10,000 Matching Annotations
  1. Last 7 days
    1. eLife Assessment

      This study, from the group that pioneered migrasome, describes a novel vaccine platform of engineered migrasomes that behave like natural migrasomes. Importantly, this platform has the potential to overcome obstacles associated with cold chain issues for vaccines such as mRNA. In the revised version, the authors have addressed previous concerns and the results from additional experiments provide compelling evidence that features methods, data, and analyses more rigorous than the current state-of-the-art. Although the findings are important with practical implications for the vaccine technology, results from additional experiments would make this an outstanding study.

    2. Reviewer #1 (Public review):

      Summary:

      Outstanding fundamental phenomenon (migrasomes) en route to become transitionally highly significant.

      Strengths:

      Innovative approach at several levels: Migrasomes, discovered by DR. Yu's group, are an outstanding biological phenomenon of fundamental interest and now of potentially practical value.

      Weaknesses:

      I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.

      Comments on revisions: This reviewer feels that the authors have addressed all issues.

    3. Reviewer #2 (Public review):

      Summary:

      The authors report describes a novel vaccine platform derived from a newly discovered organelle called a migrasome. First, the authors address a technical hurdle for using migrasomes as a vaccine platform. Natural migrasome formation occurs at low levels and is labor intensive, however, by understanding the molecular underpinning of migrasome formation, the authors have designed a method to make engineered migrasomes from cultures cells at higher yields utilizing a robust process. These engineered migrasomes behave like natural migrasomes. Next, the authors immunized mice with migrasomes that either expressed a model peptide or the SARS-CoV-2 spike protein. Antibodies against the spike protein were raised that could be boosted by a 2nd vaccination and these antibodies were functional as assessed by an in vitro pseudoviral assay. This new vaccine platform has the potential to overcome obstacles such as cold chain issues for vaccines like messenger RNA that require very stringent storage conditions.

      Strengths:

      The authors present very robust studies detailing the biology behind migrasome formation and this fundamental understanding was used to from engineered migrasomes, which makes it possible to utilize migrasomes as a vaccine platform. The characterization of engineered migrasomes is thorough and establishes comparability with naturally occurring migrasomes. The biophysical characterization of the migrasomes is well done, including thermal stability and characterization of the particle size (important characterizations for a good vaccine).

      Weaknesses:

      With a new vaccine platform technology, it would be nice to compare them head-to-head against a proven technology. The authors would improve the manuscript if they made some comparisons to other vaccine platforms such as a SARS-CoV-2 mRNA vaccine or even an adjuvanted recombinant spike protein. This would demonstrate a migrasome based vaccine could elicit responses comparable to a proven vaccine technology. Additionally, understanding the integrity of the antigens expressed in their migrasomes could be useful. This could be done by looking at functional monoclonal antibody binding to their migrasomes in a confocal microscopy experiment.

      Updates after revision:

      The revised manuscript has additional experiments that I believe improve the strength of evidence presented in the manuscript and address the weaknesses of the first draft. First, they provide a comparison to the antibody responses induced by their migrasome based platform to recombinant protein formulated in an adjuvant and show the response is comparable. Second, they provide evidence that the spike protein incorporated into their migrasomes retains structural integrity by preserving binding to monoclonal antibodies. Together, these results strengthen the paper significantly and support the claims that the novel migrasome based vaccine platform could be a useful in the vaccine development field.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This is an excellent study by a superb investigator who discovered and is championing the field of migrasomes. This study contains a hidden "gem" - the induction of migrasomes by hypotonicity and how that happens. In summary, an outstanding fundamental phenomenon (migrasomes) en route to becoming transitionally highly significant.

      Strengths:

      Innovative approach at several levels. Migrasomes - discovered by Dr Yu's group - are an outstanding biological phenomenon of fundamental interest and now of potentially practical value.

      Weaknesses:

      I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.

      We sincerely thank the reviewer for the encouraging and insightful comments. We fully agree that the fundamental aspects of migrasome biology are of great importance and deserve deeper exploration.

      In line with the reviewer’s suggestion, we have expanded our discussion on the basic biology of engineered migrasomes (eMigs). A recent study by the Okochi group at the Tokyo Institute of Technology demonstrated that hypoosmotic stress induces the formation of migrasome-like vesicles, involving cytoplasmic influx and requiring cholesterol for their formation (DOI: 10.1002/1873-3468.14816, February 2024). Building on this, our study provides a detailed characterization of hypoosmotic stressinduced eMig formation, and further compares the biophysical properties of natural migrasomes and eMigs. Notably, the inherent stability of eMigs makes them particularly promising as a vaccine platform.

      Finally, we would like to note that our laboratory continues to investigate multiple aspects of migrasome biology. In collaboration with our colleagues, we recently completed a study elucidating the mechanical forces involved in migrasome formation (DOI: 10.1016/j.bpj.2024.12.029), which further complements the findings presented here.

      Reviewer #2 (Public review):

      Summary:

      The authors' report describes a novel vaccine platform derived from a newly discovered organelle called a migrasome. First, the authors address a technical hurdle in using migrasomes as a vaccine platform. Natural migrasome formation occurs at low levels and is labor intensive, however, by understanding the molecular underpinning of migrasome formation, the authors have designed a method to make engineered migrasomes from cultured, cells at higher yields utilizing a robust process. These engineered migrasomes behave like natural migrasomes. Next, the authors immunized mice with migrasomes that either expressed a model peptide or the SARSCoV-2 spike protein. Antibodies against the spike protein were raised that could be boosted by a 2nd vaccination and these antibodies were functional as assessed by an in vitro pseudoviral assay. This new vaccine platform has the potential to overcome obstacles such as cold chain issues for vaccines like messenger RNA that require very stringent storage conditions.

      Strengths:

      The authors present very robust studies detailing the biology behind migrasome formation and this fundamental understanding was used to form engineered migrasomes, which makes it possible to utilize migrasomes as a vaccine platform. The characterization of engineered migrasomes is thorough and establishes comparability with naturally occurring migrasomes. The biophysical characterization of the migrasomes is well done including thermal stability and characterization of the particle size (important characterizations for a good vaccine).

      Weaknesses:

      With a new vaccine platform technology, it would be nice to compare them head-tohead against a proven technology. The authors would improve the manuscript if they made some comparisons to other vaccine platforms such as a SARS-CoV-2 mRNA vaccine or even an adjuvanted recombinant spike protein. This would demonstrate a migrasome-based vaccine could elicit responses comparable to a proven vaccine technology. 

      We thank the reviewer for the thoughtful evaluation and constructive suggestions, which have helped us strengthen the manuscript. 

      Comparison with proven vaccine technologies:

      In response to the reviewer’s comment, we now include a direct comparison of the antibody responses elicited by eMig-Spike and a conventional recombinant S1 protein vaccine formulated with Alum. As shown in the revised manuscript (Author response image 1), the levels of S1-specific IgG induced by the eMig-based platform were comparable to those induced by the S1+Alum formulation. This comparison supports the potential of eMigs as a competitive alternative to established vaccine platforms. 

      Author response image 1.

      eMigrasome-based vaccination showed similar efficacy compared with adjuvanted recombinant spike protein The amount of S1-specific IgG in mouse serum was quantified by ELISA on day 14 after immunization. Mice were either intraperitoneally (i.p.) immunized with recombinant Alum/S1 or intravenously (i.v.) immunized with eM-NC, eM-S or recombinant S1. The administered doses were 20 µg/mouse for eMigrasomes, 10 µg/mouse (i.v.) or 50 µg/mouse (i.p.) for recombinant S1 and 50 µl/mouse for Aluminium adjuvant.

      Assessment of antigen integrity on migrasomes:

      To address the reviewer’s suggestion regarding antigen integrity, we performed immunoblotting using antibodies against both S1 and mCherry. Two distinct bands were observed: one at the expected molecular weight of the S-mCherry fusion protein, and a higher molecular weight band that may represent oligomerized or higher-order forms of the Spike protein (Figure 5b in the revised manuscript).

      Furthermore, we performed confocal microscopy using a monoclonal antibody against Spike (anti-S). Co-localization analysis revealed strong overlap between the mCherry fluorescence and anti-Spike staining, confirming the proper presentation and surface localization of intact S-mCherry fusion protein on eMigs (Figure 5c in the revised manuscript). These results confirm the structural integrity and antigenic fidelity of the Spike protein expressed on eMigs.

      Recommendations for the authors

      Reviewer #1 (Recommendations For The Authors):

      I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.

      I know that the reviewers always ask for more, and this is not the case here. Can the abstract and title be changed to emphasize the science behind migrasome formation, and possibly add a few more fundamental aspects on how hypotonic shock induces migrasomes?

      Alternatively, if the authors desire to maintain the emphasis on vaccines, can immunological mechanisms be somewhat expanded in order to - at least to some extent - explain why migrasomes are a better vaccine vehicle?

      One way or another, this reviewer is highly supportive of this study and it is really up to the authors and the editor to decide whether my comments are of use or not.

      My recommendation is to go ahead with publishing after some adjustments as per above.

      We’d like to thank the reviewer for the suggestion. We have changed the title of the manuscript and modified the abstract, emphasizing the fundamental science behind the development of eMigrasome. To gain some immunological information on eMig illucidated antibody responses, we characterized the type of IgG induced by eM-OVA in mice, and compared it to that induced by Alum/OVA. The IgG response to Alum/OVA was dominated by IgG1. Quite differently, eM-OVA induced an even distribution of IgG subtypes, including IgG1, IgG2b, IgG2c, and IgG3 (Figure 4i in the revised manuscript). The ratio between IgG1 and IgG2a/c indicates a Th1 or Th2 type humoral immune response. Thus, eM-OVA immunization induces a balance of Th1/Th2 immune responses.

      Reviewer #2 (Recommendations For The Authors):

      The study is a very nice exploration of a new vaccine platform. This reviewer believes that a more head-to-head comparison to the current vaccine SARS-CoV-2 vaccine platform would improve the manuscript. This comparison is done with OVA antigen, but this model antigen is not as exciting as a functional head-to-head with a SARS-CoV-2 vaccine.

      I think that two other discussion points should be included in the manuscript. First, was the host-cell protein evaluated? If not, I would include that point on how issues of host cell contamination of the migrasome could play a role in the responses and safety of a vaccine. Second, I would discuss antigen incorporation and localization into the platform. For example, the full-length spike being expressed has a native signal peptide and transmembrane domain. The authors point out that a transmembrane domain can be added to display an antigen that does not have one natively expressed, however, without a signal peptide this would not be secreted and localized properly. I would suggest adding a discussion of how a non-native signal peptide would be necessary in addition to a transmembrane domain.

      We thank the reviewer for these thoughtful suggestions and fully agree that the points raised are important for the translational development of eMig-based vaccines.

      (1) Host cell proteins and potential immunogenicity:

      We appreciate the reviewer’s suggestion to consider host cell protein contamination. Considering potential clinical application of eMigrasomes in the future, we will use human cells with low immunogenicity such as HEK-293 or embryonic stem cells (ESCs) to generate eMigrasomes. Also, we will follow a QC that meets the standard of validated EV-based vaccination techniques. 

      (2) Antigen incorporation and localization—signal peptide and transmembrane domain:

      We also agree with the reviewer’s point that proper surface display of antigens on eMigs requires both a transmembrane domain and a signal peptide for correct trafficking and membrane anchoring. For instance, in the case of full-length Spike protein, the native signal peptide and transmembrane domain ensure proper localization to the plasma membrane and subsequent incorporation into eMigs. In case of OVA, a secretary protein that contains a native signal peptide yet lacks a transmembrane domain, an engineered transmembrane domain is required. For antigens that do not naturally contain these features, both a non-native signal peptide and an artificial transmembrane domain are necessary. We have clarified this point in the revised discussion and explicitly noted the requirement for a signal peptide when engineering antigens for surface display on migrasomes.

    1. eLife Assessment

      This paper reports the fundamental finding of how Raman spectral patterns correlate with proteome profiles using Raman spectra of E. coli cells from different physiological conditions and found global stoichiometric regulation on proteomes. The authors' findings provide compelling evidence that stoichiometric regulation of proteomes is general through analysis of both bacterial and human cells. In the future, similar methodology can be applied on various tissue types and microbial species for studying proteome composition with Raman spectral patterns.

    2. Reviewer #1 (Public review):

      Summary

      This work performed Raman spectral microscopy for E. coli cells with 15 different culture conditions. The author developed a theoretical framework to construct a regression matrix which predicts proteome composition by Raman data. Specifically, this regression matrix is obtained by statistical inference from various experimental conditions. With this model, the authors categorized co-expressed genes and illustrate how proteome stoichiometry is regulated among different culture conditions. Co-expressed gene clusters were investigated and identified as homeostasis core, carbon-source dependent, and stationary phase dependent genes. Overall, the author demonstrates a strong and comprehensive data analysis scheme for the joint analysis of Raman and proteome datasets.

      Strengths and major contributions

      Major contributions: (1) Experimentally, the authors contributed Raman datasets of E. coli with various growth conditions. (2) In data analysis, the authors developed a scheme to compare proteome and Raman datasets. Protein co-expression clusters were identified, and their biological meaning were investigated.

      Discussion and impact for the field

      Raman signature contains both proteomic and metabolomic information and is an orthogonal method to infer the composition biomolecules. This work is a strong initiative for introducing the powerful technique to systems biology and provide a rigorous pipeline for future data analysis. The regression matrix can be used for cross-comparison among future experimental results on proteome-Raman datasets.

      Comments on revisions:

      The authors addressed all my questions nicely. In particular, the subsampling test demonstrated that with enough "distinct" physiological condition (even for m=5) one could already explore the major mode of proteome regulation and Raman signature. The main text has been streamlined and the clarity is improved. I have a minor suggestion:

      (i) For equation (1), it is important to emphasize that the formula works for every j=1,...,15, and the regression matrix B is obtained by statistical inference by summarizing data from all 15 conditions.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary

      This work performed Raman spectral microscopy at the single-cell level for 15 different culture conditions in E. coli. The Raman signature is systematically analyzed and compared with the proteome dataset of the same culture conditions. With a linear model, the authors revealed correspondence between Raman pattern and proteome expression stoichiometry indicating that spectrometry could be used for inferring proteome composition in the future. With both Raman spectra and proteome datasets, the authors categorized co-expressed genes and illustrated how proteome stoichiometry is regulated among different culture conditions. Co-expressed gene clusters were investigated and identified as homeostasis core, carbon-source dependent, and stationary phase-dependent genes. Overall, the authors demonstrate a strong and solid data analysis scheme for the joint analysis of Raman and proteome datasets.

      Strengths and major contributions

      (1) Experimentally, the authors contributed Raman datasets of E. coli with various growth conditions.

      (2) In data analysis, the authors developed a scheme to compare proteome and Raman datasets. Protein co-expression clusters were identified, and their biological meaning was investigated.

      Weaknesses

      The experimental measurements of Raman microscopy were conducted at the single-cell level; however, the analysis was performed by averaging across the cells. The author did not discuss if Raman microscopy can used to detect cell-to-cell variability under the same condition.

      We thank the reviewer for raising this important point. Though this topic is beyond the scope of our study, some of our authors have addressed the application of single-cell Raman spectroscopy to characterizing phenotypic heterogeneity in individual Staphylococcus aureus cells in another paper (Kamei et al., bioRxiv, doi: 10.1101/2024.05.12.593718). Additionally, one of our authors demonstrated that single-cell RNA sequencing profiles can be inferred from Raman images of mouse cells (Kobayashi-Kirschvink et al., Nat. Biotechnol. 42, 1726–1734, 2024). Therefore, detecting cell-to-cell variability under the same conditions has been shown to be feasible. Whether averaging single-cell Raman spectra is necessary depends on the type of analysis and the available dataset. We will discuss this in more detail in our response to Comment (1) by Reviewer #1 (Recommendation for the authors).

      Discussion and impact on the field

      Raman signature contains both proteomic and metabolomic information and is an orthogonal method to infer the composition of biomolecules. It has the advantage that single-cell level data could be acquired and both in vivo and in vitro data can be compared. This work is a strong initiative for introducing the powerful technique to systems biology and providing a rigorous pipeline for future data analysis.

      Reviewer #2 (Public review):

      Summary and strengths:

      Kamei et al. observe the Raman spectra of a population of single E. coli cells in diverse growth conditions. Using LDA, Raman spectra for the different growth conditions are separated. Using previously available protein abundance data for these conditions, a linear mapping from Raman spectra in LDA space to protein abundance is derived. Notably, this linear map is condition-independent and is consequently shown to be predictive for held-out growth conditions. This is a significant result and in my understanding extends the earlier Raman to RNA connection that has been reported earlier.

      They further show that this linear map reveals something akin to bacterial growth laws (ala Scott/Hwa) that the certain collection of proteins shows stoichiometric conservation, i.e. the group (called SCG - stoichiometrically conserved group) maintains their stoichiometry across conditions while the overall scale depends on the conditions. Analyzing the changes in protein mass and Raman spectra under these conditions, the abundance ratios of information processing proteins (one of the large groups where many proteins belong to "information and storage" - ISP that is also identified as a cluster of orthologous proteins) remain constant. The mass of these proteins deemed, the homeostatic core, increases linearly with growth rate. Other SCGs and other proteins are condition-specific.

      Notably, beyond the ISP COG the other SCGs were identified directly using the proteome data. Taking the analysis beyond they then how the centrality of a protein - roughly measured as how many proteins it is stoichiometric with - relates to function and evolutionary conservation. Again significant results, but I am not sure if these ideas have been reported earlier, for example from the community that built protein-protein interaction maps.

      As pointed out, past studies have revealed that the function, essentiality, and evolutionary conservation of genes are linked to the topology of gene networks, including protein-protein interaction networks. However, to the best of our knowledge, their linkage to stoichiometry conservation centrality of each gene has not yet been established.

      Previously analyzed networks, such as protein-protein interaction networks, depend on known interactions. Therefore, as our understanding of the molecular interactions evolves with new findings, the conclusions may change. Furthermore, analysis of a particular interaction network cannot account for effects from different types of interactions or multilayered regulations affecting each protein species.

      In contrast, the stoichiometry conservation network in this study focuses solely on expression patterns as the net result of interactions and regulations among all types of molecules in cells. Consequently, the stoichiometry conservation networks are not affected by the detailed knowledge of molecular interactions and naturally reflect the global effects of multilayered interactions. Additionally, stoichiometry conservation networks can easily be obtained for non-model organisms, for which detailed molecular interaction information is usually unavailable. Therefore, analysis with the stoichiometry conservation network has several advantages over existing methods from both biological and technical perspectives.

      We added a paragraph explaining this important point to the Discussion section, along with additional literature.

      Finally, the paper built a lot of "machinery" to connect ¥Omega_LE, built directly from proteome, and ¥Omega_B, built from Raman, spaces. I am unsure how that helps and have not been able to digest the 50 or so pages devoted to this.

      The mathematical analyses in the supplementary materials form the basis of the argument in the main text. Without the rigorous mathematical discussions, Fig. 6E — one of the main conclusions of this study — and Fig. 7 could never be obtained. Therefore, we believe the analyses are essential to this study. However, we clarified why each analysis is necessary and significant in the corresponding sections of the Results to improve the manuscript's readability.

      Please see our responses to comments (2) and (7) by Reviewer #1 (Recommendations for the authors) and comments (5) and (6) by Reviewer #2 (Recommendations for the authors).

      Strengths:

      The rigorous analysis of the data is the real strength of the paper. Alongside this, the discovery of SCGs that are condition-independent and that are condition-dependent provides a great framework.

      Weaknesses:

      Overall, I think it is an exciting advance but some work is needed to present the work in a more accessible way.

      We edited the main text to make it more accessible to a broader audience. Please see our responses to comments (2) and (7) by Reviewer #1 (Recommendations for the authors) and comments (5) and (6) by Reviewer #2 (Recommendations for the authors).

      Reviewer #1 (Recommendations for the authors):

      (1) The Raman spectral data is measured from single-cell imaging. In the current work, most of the conclusions are from averaged data. From my understanding, once the correspondence between LDA and proteome data is established (i.e. the matrix B) one could infer the single-cell proteome composition from B. This would provide valuable information on how proteome composition fluctuates at the single-cell level.

      We can calculate single-cell proteomes from single-cell Raman spectra in the manner suggested by the reviewer. However, we cannot evaluate the accuracy of their estimation without single-cell proteome data under the same environmental conditions. Likewise, we cannot verify variations of estimated proteomes of single cells. Since quantitatively accurate single-cell proteome data is unavailable, we concluded that addressing this issue was beyond the scope of this study.

      Nevertheless, we agree with the reviewer that investigating how proteome composition fluctuates at the single-cell level based on single-cell Raman spectra is an intriguing direction for future research. In this regard, some of our authors have studied the phenotypic heterogeneity of Staphylococcus aureus cells using single-cell Raman spectra in another paper (Kamei et al., bioRxiv, doi: 10.1101/2024.05.12.593718), and one of our authors has demonstrated that single-cell RNA sequencing profiles can be inferred from Raman images of mouse cells (Kobayashi-Kirschvink et al., Nat. Biotechnol. 42, 1726–1734, 2024). Therefore, it is highly plausible that single-cell Raman spectroscopy can also characterize proteomic fluctuations in single cells. We have added a paragraph to the Discussion section to highlight this important point.

      (2) The establishment of matrix B is quite confusing for readers who only read the main text. I suggest adding a flow chart in Figure 1 to explain the data analysis pipeline, as well as state explicitly what is the dimension of B, LDA matrix, and proteome matrix.

      We thank the reviewer for the suggestion. Following the reviewer's advice, we have explicitly stated the dimensions of the vectors and matrices in the main text. We have also added descriptions of the dimensions of the constructed spaces. Rather than adding another flow chart to Figure 1, we added a new table (Table 1) to explain the various symbols representing vectors and matrices, thereby improving the accessibility of the explanation.

      (3) One of the main contributions for this work is to demonstrate how proteome stoichiometry is regulated across different conditions. A total of m=15 conditions were tested in this study, and this limits the rank of LDA matrix as 14. Therefore, maximally 14 "modes" of differential composition in a proteome can be detected.

      As a general reader, I am wondering in the future if one increases or decreases the number of conditions (say m=5 or m=50) what information can be extracted? It is conceivable that increasing different conditions with distinct cellular physiology would be beneficial to "explore" different modes of regulation for cells. As proof of principle, I am wondering if the authors could test a lower number (by sub-sampling from m=15 conditions, e.g. picking five of the most distinct conditions) and see how this would affect the prediction of proteome stoichiometry inference.

      We thank the reviewer for bringing an important point to our attention. To address the issue raised, we conducted a new subsampling analysis (Fig. S14).

      As we described in the main text (Fig. 6E) and the supplementary materials, the m x m orthogonal matrix, Θ, represents to what extent the two spaces Ω<sub>LE</sub> and Ω<sub>B</sub> are similar (m is the number of conditions; in our main analysis, m = 15). Thus, the low-dimensional correspondence between the two spaces connected by an orthogonal transformation, such as an m-dimensional rotation, can be evaluated by examining the elements of the matrix Θ. Specifically, large off-diagonal elements of the matrix  mix higher dimensions and lower dimensions, making the two spaces spanned by the first few major axes appear dissimilar. Based on this property, we evaluated the vulnerability of the low-dimensional correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub> to the reduced number of conditions by measuring how close Θ was to the identity matrix when the analysis was performed on the subsampled datasets.

      In the new figure (Fig. S14), we first created all possible smaller condition sets by subsampling the conditions. Next, to evaluate the closeness between the matrix Θ and the identity matrix for each smaller condition set, we generated 10,000 random orthogonal matrices of the same size as . We then evaluated the probability of obtaining a higher level of low-dimensional correspondence than that of the experimental data by chance (see section 1.8 of the Supplementary Materials). This analysis was already performed in the original manuscript for the non-subsampled case (m = 15) in Fig. S9C; the new analysis systematically evaluates the correspondence for the subsampled datasets.

      The results clearly show that low-dimensional correspondence is more likely to be obtained with more conditions (Fig. S14). In particular, when the number of conditions used in the analysis exceeds five, the median of the probability that random orthogonal matrices were closer to the identity matrix than the matrix Θ calculated from subsampled experimental data became lower than 10<sup>-4</sup>. This analysis provides insight into the number of conditions required to find low-dimensional correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub>.

      What conditions are used in the analysis can change the low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub> . Therefore, it is important to clarify whether including more conditions in the analysis reduces the dependence of the low-dimensional structures on conditions. We leave this issue as a subject for future study. This issue relates to the effective dimensionality of omics profiles needed to establish the diverse physiological states of cells across conditions. Determining the minimum number of conditions to attain the condition-independent low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub> would provide insight into this fundamental problem. Furthermore, such an analysis would identify the range of applications of Raman spectra as a tool for capturing macroscopic properties of cells at the system level.

      We now discuss this point in the Discussion section, referring to this analysis result (Fig. S14). Please also see our reply to the comment (1) by Reviewer #2 (Recommendations for the authors).

      (4) In E. coli cells, total proteome is in mM concentration while the total metabolites are between 10 to 100 mM concentration. Since proteins are large molecules with more functional groups, they may contribute to more Raman signal (per molecules) than metabolites. Still, the meaningful quantity here is the "differential Raman signal" with different conditions, not the absolute signal. I am wondering how much percent of differential Raman signature are from proteome and how much are from metabolome.

      It is an important and interesting question to what extent changes in the proteome and metabolome contribute to changes in Raman spectra. Though we concluded that answering this question is beyond the scope of this study, we believe it is an important topic for future research.

      Raman spectral patterns convey the comprehensive molecular composition spanning the various omics layers of target cells. Changes in the composition of these layers can be highly correlated, and identifying their contributions to changes in Raman spectra would provide insight into the mutual correlation of different omics layers. Addressing the issue raised by the reviewer would expand the applications of Raman spectroscopy and highlight the advantage of cellular Raman spectra as a means of capturing comprehensive multi-omics information.

      We note that some studies have evaluated the contributions of proteins, lipids, nucleic acids, and glycogen to the Raman spectra of mammalian cells and how these contributions change in different states (e.g., Mourant et al., J Biomed Opt, 10(3), 031106, 2005). Additionally, numerous studies have imaged or quantified metabolites in various cell types (see, for example, Cutshaw et al., Chemical Reviews, 123(13), 8297–8346, 2023, for a comprehensive review). Extending these approaches to multiple omics layers in future studies would help resolve the issue raised by the reviewer.

      (5) It is known that E. coli cells in different conditions have different cell sizes, where cell width increases with carbon source quality and growth rate. Does this effect be normalized when processing the Raman signal?

      Each spectrum was normalized by subtracting the average and dividing it by the standard deviation. This normalization minimizes the differences in signal intensities due to different cell sizes and densities. This information is shown in the Materials and Methods section of the Supplementary Materials.

      (6) I have a question about interpretation of the centrality index. A higher centrality indicates the protein expression pattern is more aligned with the "mainstream" of the other proteins in the proteome. However, it is possible that the proteome has multiple" mainstream modes" (with possibly different contributions in magnitudes), and the centrality seems to only capture the "primary mode". A small group of proteins could all have low centrality but have very consistent patterns with high conservation of stoichiometry. I wondering if the author could discuss and clarify with this.

      We thank the reviewer for drawing our attention to the insufficient explanation in the original manuscript. First, we note that stoichiometry conserving protein groups are not limited to those composed of proteins with high stoichiometry conservation centrality. The SCGs 2–5 are composed of proteins that strongly conserve stoichiometry within each group but have low stoichiometry conservation centrality (Fig. 5A, 5K, 5L, and 7A). In other words, our results demonstrate the existence of the "primary mainstream mode" (SCG 1, i.e., the homeostatic core) and condition-specific "non-primary mainstream modes" (SCGs 2–5). These primary and non-primary modes are distinguishable by their position along the axis of stoichiometry conservation centrality (Fig. 5A, 5K, and 5L).

      However, a single one-dimensional axis (centrality) cannot capture all characteristics of stoichiometry-conserving architecture. In our case, the "non-primary mainstream modes" (SCGs 2–5) were distinguished from each other by multiple csLE axes.

      To clarify this point, we modified the first paragraph of the section where we first introduce csLE (Revealing global stoichiometry conservation architecture of the proteomes with csLE). We also added a paragraph to the Discussion section regarding the condition-specific SCGs 2–5.

      (7) Figures 3, 4, and 5A-I are analyses on proteome data and are not related to Raman spectral data. I am wondering if this part of the analysis can be re-organized and not disrupt the mainline of the manuscript.

      We agree that the structure of this manuscript is complicated. Before submitting this manuscript to eLife, we seriously considered reorganizing it. However, we concluded that this structure was most appropriate because our focus on stoichiometry conservation cannot be explained without analyzing the coefficients of the Raman-proteome correspondence using COG classification (see Fig. 3; note that Fig. 3A relates to Raman data). This analysis led us to examine the global stoichiometry conservation architecture of proteomes (Figs. 4 and 5) and discover the unexpected similarity between the low-dimensional structures of Ω<sub>LE</sub> and Ω<sub>B</sub>

      Therefore, we decided to keep the structure of the manuscript as it is. To partially resolve this issue, however, we added references to Fig. S1, the diagram of this paper’s mainline, to several places in the main text so that readers can more easily grasp the flow of the manuscript.

      (8) Supplementary Equation (2.6) could be wrong. From my understanding of the coordinate transformation definition here, it should be [w1 ... ws] X := RHS terms in big parenthesis.

      We checked the equation and confirmed that it is correct.

      Reviewer #2 (Recommendations for the authors):

      (1) The first main result or linear map between raman and proteome linked via B is intriguing in the sense that the map is condition-independent. A speculative question I have is if this relationship may become more complex or have more condition-dependent corrections as the number of conditions goes up. The 15 or so conditions are great but it is not clear if they are often quite restrictive. For example, they assume an abundance of most other nutrients. Now if you include a growth rate decrease due to nitrogen or other limitations, do you expect this to work?

      In our previous paper (Kobayashi-Kirschvink et al., Cell Systems 7(1): 104–117.e4, 2018), we statistically demonstrated a linear correspondence between cellular Raman spectra and transcriptomes for fission yeast under 10 environmental conditions. These conditions included nutrient-rich and nutrient-limited conditions, such as nitrogen limitation. Since the Raman-transcriptome correspondence was only statistically verified in that study, we analyzed the data from the standpoint of stoichiometry conservation in this study. The results (Fig. S11 and S12) revealed a correspondence in lower dimensions similar to that observed in our main results. In addition, similar correspondences were obtained even for different E. coli strains under common culture conditions (Fig. S11 and S12). Therefore, it is plausible that the stoichiometry-conservation low-dimensional correspondence between Raman and gene expression profiles holds for a wide range of external and internal perturbations.

      We agree with the reviewer that it is important to understand how Raman-omics correspondences change with the number of conditions. To address this issue, we examined how the correspondence between Ω<sub>LE</sub> and Ω<sub>B</sub> changes by subsampling the conditions used in the analysis. We focused on , which was introduced in Fig. 5E, because the closeness of Θ to the identity matrix represents correspondence precision. We found a general trend that the low-dimensional correspondence becomes more precise as the number of conditions increases (Fig. S14). This suggests that increasing the number of conditions generally improves the correspondence rather than disrupting it.

      We added a paragraph to the Discussion section addressing this important point. Please also refer to our response to Comment (3) of Reviewer #1 (Recommendations for the authors).

      (2) A little more explanation in the text for 3C/D would help. I am imagining 3D is the control for 3C. Minor comment - 3B looks identical to S4F but the y-axis label is different.

      We thank the reviewer for pointing out the insufficient explanation of Fig. 3C and 3D in the main text. Following this advice, we added explanations of these plots to the main text. We also added labels ("ISP COG class" and "non-ISP COG class") to the top of these two figures.

      Fig. 3B and S4F are different. For simplicity, we used the Pearson correlation coefficient in Fig. 3B. However, cosine similarity is a more appropriate measure for evaluating the degree of conservation of abundance ratios. Thus, we presented the result using cosine similarity in a supplementary figure (Fig. S4F). Please note that each point in Fig. S4F is calculated between proteome vectors of two conditions. The dimension of each proteome vector is the number of genes in each COG class.

      (3) Can we see a log-log version of 4C to see how the low-abundant proteins are behaving? In fact, the same is in part true for Figure 3A.

      We added the semi-log version of the graph for SCG1 (the homeostatic core) in Fig. 4C to make low-abundant proteins more visible. Please note that the growth rates under the two stationary-phase conditions were zero; therefore, plotting this graph in log-log format is not possible.

      Fig. 3A cannot be shown as a log-log plot because many of the coefficients are negative. The insets in the graphs clarify the points near the origin.

      (4) In 5L, how should one interpret the other dots that are close to the center but not part of the SCG1? And this theme continues in 6ACD and 7A.

      The SCGs were obtained by setting a cosine similarity threshold. Therefore, proteins that are close to SCG 1 (the homeostatic core) but do not belong to it have a cosine similarity below the threshold with any protein in SCG 1. Fig. 7 illustrates the expression patterns of the proteins in question.

      (5) Finally, I do not fully appreciate the whole analysis of connecting ¥Omega_csLE and ¥Omega_B and plots in 6 and 7. This corresponds to a lot of linear algebra in the 50 or so pages in section 1.8 in the supplementary. If the authors feel this is crucial in some way it needs to be better motivated and explained. I philosophically appreciate developing more formalism to establish these connections but I did not understand how this (maybe even if in the future) could lead to a new interpretation or analysis or theory.

      The mathematical analyses included in the supplementary materials are important for readers who are interested in understanding the mathematics behind our conclusions. However, we also thought these arguments were too detailed for many readers when preparing the original submission and decided to show them in the supplemental materials.

      To better explain the motivation behind the mathematical analyses, we revised the section “Representing the proteomes using the Raman LDA axes”.

      Please also see our reply to the comment (6) by Reviewer #2 (Recommendations for the authors) below.

      (6) Along the lines of the previous point, there seems to be two separate points being made: a) there is a correspondence between Raman and proteins, and b) we can use the protein data to look at centrality, generality, SCGs, etc. And the two don't seem to be linked until the formalism of ¥Omegas?

      The reviewer is correct that we can calculate and analyze some of the quantities introduced in this study, such as stoichiometry conservation centrality and expression generality, without Raman data. However, it is difficult to justify introducing these quantities without analyzing the correspondence between the Raman and proteome profiles. Moreover, the definition of expression generality was derived from the analysis of Raman-proteome correspondence (see section 2.2 of the Supplementary Materials). Therefore, point b) cannot stand alone without point a) from its initial introduction.

      To partially improve the readability and resolve the issue of complicated structure of this manuscript, we added references to Fig. S1, which is a diagram of the paper’s mainline, to several places in the main text. Please also see our reply to the comment (7) by Reviewer #1 (Recommendations for the authors).

    1. eLife Assessment

      The authors analyzed spectral properties of neural activity recorded using laminar probes while mice engaged in a global/local visual oddball paradigm. They found solid evidence for an increase in gamma (and theta in some cases) for unpredictable versus predictable stimuli, and a reduction in alpha/beta, which they consider evidence towards a "predictive routing" scheme. The study is overall important because it addresses the basis of predictive processing in the cortex, but some of the analytical choices could be better motivated, and overall, the manuscript can be improved by performing additional analyses.

    2. Reviewer #1 (Public review):

      Summary:

      The authors recorded neural activity using laminar probes while mice engaged in a global/local visual oddball paradigm. The focus of the article is on oscillatory activity, and found activity differences in theta, alpha/beta, and gamma bands related to predictability and prediction error.

      I think this is an important paper, providing more direct evidence for the role of signals in different frequency bands related to predictability and surprise in the sensory cortex.

      Comments:

      Below are some comments that may hopefully help further improve the quality of this already very interesting manuscript.

      (1) Introduction:

      The authors write in their introduction: "H1 further suggests a role for θ oscillations in prediction error processing as well." Without being fleshed out further, it is unclear what role this would be, or why. Could the authors expand this statement?

      (2) Limited propagation of gamma band signals:

      Some recent work (e.g. https://www.cell.com/cell-reports/fulltext/S2211-1247(23)00503-X) suggests that gamma-band signals reflect mainly entrainment of the fast-spiking interneurons, and don't propagate from V1 to downstream areas. Could the authors connect their findings to these emerging findings, suggesting no role in gamma-band activity in communication outside of the cortical column?

      (3) Paradigm:

      While I agree that the paradigm tests whether a specific type of temporal prediction can be formed, it is not a type of prediction that one would easily observe in mice, or even humans. The regularity that must be learned, in order to be able to see a reflection of predictability, integrates over 4 stimuli, each shown for 500 ms with a 500 ms blank in between (and a 1000 ms interval separating the 4th stimulus from the 1st stimulus of the next sequence). In other words, the mouse must keep in working memory three stimuli, which partly occurred more than a second ago, in order to correctly predict the fourth stimulus (and signal a 1000 ms interval as evidence for starting a new sequence).

      A problem with this paradigm is that positive findings are easier to interpret than negative findings. If mice do not show a modulation to the global oddball, is it because "predictive coding" is the wrong hypothesis, or simply because the authors generated a design that operates outside of the boundary conditions of the theory? I think the latter is more plausible. Even in more complex animals, (eg monkeys or humans), I suspect that participants would have trouble picking up this regularity and sequence, unless it is directly task-relevant (which it is not, in the current setting). Previous experiments often used simple pairs (where transitional probability was varied, eg, Meyer and Olson, PNAS 2012) of stimuli that were presented within an intervening blank period. Clearly, these regularities would be a lot simpler to learn than the highly complex and temporally spread-out regularity used here, facilitating the interpretation of negative findings (especially in early cortical areas, which are known to have relatively small temporal receptive fields).

      I am, of course, not asking the authors to redesign their study. I would like to ask them to discuss this caveat more clearly, in the Introduction and Discussion, and situate their design in the broader literature. For example, Jeff Gavornik has used much more rapid stimulus designs and observed clear modulations of spiking activity in early visual regions. I realize that this caveat may be more relevant for the spiking paper (which does not show any spiking activity modulation in V1 by global predictability) than for the current paper, but I still think it is an important general caveat to point out.

      (4) Reporting of results:

      I did not see any quantification of the strength of evidence of any of the results, beyond a general statement that all reported results pass significance at an alpha=0.01 threshold. It would be informative to know, for all reported results, what exactly the p-value of the significant cluster is; as well as for which performed tests there was no significant difference.

      (5) Cluster test:

      The authors use a three-dimensional cluster test, clustering across time, frequency, and location/channel. I am wondering how meaningful this analytical approach is. For example, there could be clusters that show an early difference at some location in low frequencies, and then a later difference in a different frequency band at another (adjacent) location. It seems a priori illogical to me to want to cluster across all these dimensions together, given that this kind of clustering does not appear neurophysiologically implausible/not meaningful. Can the authors motivate their choice of three-dimensional clustering, or better, facilitating interpretability, cluster eg at space and time within specific frequency bands (2d clustering)?

    3. Reviewer #2 (Public review):

      Summary:

      Sennesh and colleagues analyzed LFP data from 6 regions of rodents while they were habituated to a stimulus sequence containing a local oddball (xxxy) and later exposed to either the same (xxxY) or a deviant global oddball (xxxX). Subsequently, they were exposed to a controlled random sequence (XXXY) or a controlled deterministic sequence (xxxx or yyyy). From these, the authors looked for differences in spectral properties (both oscillatory and aperiodic) between three contrasts (only for the last stimulus of the sequence).

      (1) Deviance detection: unpredictable random (XXXY) versus predictable habituation (xxxy)

      (2) Global oddball: unpredictable global oddball (xxxX) versus predictable deterministic (xxxx), and

      (3) "Stimulus-specific adaptation:" locally unpredictable oddball (xxxY) versus predictable deterministic (yyyy).

      They found evidence for an increase in gamma (and theta in some cases) for unpredictable versus predictable stimuli, and a reduction in alpha/beta, which they consider evidence towards the "predictive routing" scheme.

      While the dataset and analyses are well-suited to test evidence for predictive coding versus alternative hypotheses, I felt that the formulation was ambiguous, and the results were not very clear. My major concerns are as follows:

      (1) The authors set up three competing hypotheses, in which H1 and H2 make directly opposite predictions. However, it must be noted that H2 is proposed for spatial prediction, where the predictability is computed from the part of the image outside the RF. This is different from the temporal prediction that is tested here. Evidence in favor of H2 is readily observed when large gratings are presented, for which there is substantially more gamma than in small images. Actually, there are multiple features in the spectral domain that should not be conflated, namely (i) the transient broadband response, which includes all frequencies, (ii) contribution from the evoked response (ERP), which is often in frequencies below 30 Hz, (iii) narrow-band gamma oscillations which are produced by large and continuous stimuli (which happen to be highly predictive), and (iv) sustained low-frequency rhythms in theta and alpha/beta bands which are prominent before stimulus onset and reduce after ~200 ms of stimulus onset. The authors should be careful to incorporate these in their formulation of PC, and in particular should not conflate narrow-band and broadband gamma.

      (2) My understanding is that any aspect of predictive coding must be present before the onset of stimulus (expected or unexpected). So, I was surprised to see that the authors have shown the results only after stimulus onset. For all figures, the authors should show results from -500 ms to 500 ms instead of zero to 500 ms.

      (3) In many cases, some change is observed in the initial ~100 ms of stimulus onset, especially for the alpha/beta and theta ranges. However, the evoked response contributes substantially in the transient period in these frequencies, and this evoked response could be different for different conditions. The authors should show the evoked responses to confirm the same, and if the claim really is that predictions are carried by genuine "oscillatory" activity, show the results after removing the ERP (as they had done for the CSD analysis).

      (4) I was surprised by the statistics used in the plots. Anything that is even slightly positive or negative is turning out to be significant. Perhaps the authors could use a more stringent criterion for multiple comparisons?

      (5) Since the design is blocked, there might be changes in global arousal levels. This is particularly important because the more predictive stimuli in the controlled deterministic stimuli were presented towards the end of the session, when the animal is likely less motivated. One idea to check for this is to do the analysis on the 3rd stimulus instead of the 4th? Any general effect of arousal/attention will be reflected in this stimulus.

      (6) The authors should also acknowledge/discuss that typical stimulus presentation/attention modulation involves both (i) an increase in broadband power early on and (ii) a reduction in low-frequency alpha/beta power. This could be just a sensory response, without having a role in sending prediction signals per se. So the predictive routing hypothesis should involve testing for signatures of prediction while ruling out other confounds related to stimulus/cognition. It is, of course, very difficult to do so, but at the same time, simply showing a reduction in low-frequency power coupled with an increase in high-frequency power is not sufficient to prove PR.

      (7) The CSD results need to be explained better - you should explain on what basis they are being called feedforward/feedback. Was LFP taken from Layer 4 LFP (as was done by van Kerkoerle et al, 2014)? The nice ">" and "<" CSD patterns (Figure 3B and 3F of their paper) in that paper are barely observed in this case, especially for the alpha/beta range.

      (8) Figure 4a-c, I don't see a reduction in the broadband signal in a compared to b in the initial segment. Maybe change the clim to make this clearer?

      (9) Figure 5 - please show the same for all three frequency ranges, show all bars (including the non-significant ones), and indicate the significance (p-values or by *, **, ***, etc) as done usually for bar plots.

      (10) Their claim of alpha/beta oscillations being suppressed for unpredictable conditions is not as evident. A figure akin to Figure 5 would be helpful to see if this assertion holds.

      (11) To investigate the prediction and violation or confirmation of expectation, it would help to look at both the baseline and stimulus periods in the analyses.

    4. Reviewer #3 (Public review):

      Summary:

      In their manuscript entitled "Ubiquitous predictive processing in the spectral domain of sensory cortex", Sennesh and colleagues perform spectral analysis across multiple layers and areas in the visual system of mice. Their results are timely and interesting as they provide a complement to a study from the same lab focussed on firing rates, instead of oscillations. Together, the present study argues for a hypothesis called predictive routing, which argues that non-predictable stimuli are gated by Gamma oscillations, while alpha/beta oscillations are related to predictions.

      Strengths:

      (1) The study contains a clear introduction, which provides a clear contrast between a number of relevant theories in the field, including their hypotheses in relation to the present data set.

      (2) The study provides a systematic analysis across multiple areas and layers of the visual cortex.

      Weaknesses:

      (1) It is claimed in the abstract that the present study supports predictive routing over predictive coding; however, this claim is nowhere in the manuscript directly substantiated. Not even the differences are clearly laid out, much less tested explicitly. While this might be obvious to the authors, it remains completely opaque to the reader, e.g., as it is also not part of the different hypotheses addressed. I guess this result is meant in contrast to reference 17, by some of the same authors, which argues against predictive coding, while the present work finds differences in the results, which they relate to spectral vs firing rate analysis (although without direct comparison).

      (2) Most of the claims about a direction of propagation of certain frequency-related activities (made in the context of Figures 2-4) are - to the eyes of the reviewer - not supported by actual analysis but glimpsed from the pictures, sometimes, with very little evidence/very small time differences to go on. To keep these claims, proper statistical testing should be performed.

      (3) Results from different areas are barely presented. While I can see that presenting them in the same format as Figures 2-4 would be quite lengthy, it might be a good idea to contrast the right columns (difference plots) across areas, rather than just the overall averages.

      (4) Statistical testing is treated very generally, which can help to improve the readability of the text; however, in the present case, this is a bit extreme, with even obvious tests not reported or not even performed (in particular in Figure 5).

      (5) The description of the analysis in the methods is rather short and, to my eye, was missing one of the key descriptions, i.e., how the CSD plots were baselined (which was hinted at in the results, but, as far as I know, not clearly described in the analysis methods). Maybe the authors could section the methods more to point out where this is discussed.

      (6) While I appreciate the efforts of the authors to formulate their hypotheses and test them clearly, the text is quite dense at times. Partly this is due to the compared conditions in this paradigm; however, it would help a lot to show a visualization of what is being compared in Figures 2-4, rather than just showing the results.

    5. Author response:

      We would like to thank the three Reviewers for their thoughtful comments and detailed feedback. We are pleased to hear that the Reviewers found our paper to be “providing more direct evidence for the role of signals in different frequency bands related to predictability and surprise” (R1), “well-suited to test evidence for predictive coding versus alternative hypotheses” (R2), and “timely and interesting” (R3).

      We perceive that the reviewers have an overall positive impression of the experiments and analyses, but find the text somewhat dense and would like to see additional statistical rigor, as well as in some cases additional analyses to be included in supplementary material. We therefore here provide a provisional letter addressing revisions we have already performed and outlining the revision we are planning point-by-point. We begin each enumerated point with the Reviewer’s quoted text and our responses to each point are made below.

      Reviewer 1:

      (1) Introduction:

      The authors write in their introduction: "H1 further suggests a role for θ oscillations in prediction error processing as well." Without being fleshed out further, it is unclear what role this would be, or why. Could the authors expand this statement?”

      We have edited the text to indicate that theta-band activity has been related to prediction error processing as an empirical observation, and must regrettably leave drawing inferences about its functional role to future work, with experiments designed specifically to draw out theta-band activity.

      (2) Limited propagation of gamma band signals:

      Some recent work (e.g. https://www.cell.com/cell-reports/fulltext/S2211-1247(23)00503-X) suggests that gamma-band signals reflect mainly entrainment of the fast-spiking interneurons, and don't propagate from V1 to downstream areas. Could the authors connect their findings to these emerging findings, suggesting no role in gamma-band activity in communication outside of the cortical column?”

      We have not specifically claimed that gamma propagates between columns/areas in our recordings, only that it synchronizes synaptic current flows between laminar layers within a column/area. We nonetheless suggest that gamma can locally synchronize a column, and potentially local columns within an area via entrainment of local recurrent spiking, to update an internal prediction/representation upon onset of a prediction error. We also point the Reviewer to our Discussion section, where we state that our results fit with a model “whereby θ oscillations synchronize distant areas, enabling them to exchange relevant signals during cognitive processing.” In our present work, we therefore remain agnostic about whether theta or gamma or both (or alternative mechanisms) are at play in terms of how prediction error signals are transmitted between areas.

      (3) Paradigm:

      While I agree that the paradigm tests whether a specific type of temporal prediction can be formed, it is not a type of prediction that one would easily observe in mice, or even humans. The regularity that must be learned, in order to be able to see a reflection of predictability, integrates over 4 stimuli, each shown for 500 ms with a 500 ms blank in between (and a 1000 ms interval separating the 4th stimulus from the 1st stimulus of the next sequence). In other words, the mouse must keep in working memory three stimuli, which partly occurred more than a second ago, in order to correctly predict the fourth stimulus (and signal a 1000 ms interval as evidence for starting a new sequence).

      A problem with this paradigm is that positive findings are easier to interpret than negative findings. If mice do not show a modulation to the global oddball, is it because "predictive coding" is the wrong hypothesis, or simply because the authors generated a design that operates outside of the boundary conditions of the theory? I think the latter is more plausible. Even in more complex animals, (eg monkeys or humans), I suspect that participants would have trouble picking up this regularity and sequence, unless it is directly task-relevant (which it is not, in the current setting). Previous experiments often used simple pairs (where transitional probability was varied, eg, Meyer and Olson, PNAS 2012) of stimuli that were presented within an intervening blank period. Clearly, these regularities would be a lot simpler to learn than the highly complex and temporally spread-out regularity used here, facilitating the interpretation of negative findings (especially in early cortical areas, which are known to have relatively small temporal receptive fields).

      I am, of course, not asking the authors to redesign their study. I would like to ask them to discuss this caveat more clearly, in the Introduction and Discussion, and situate their design in the broader literature. For example, Jeff Gavornik has used much more rapid stimulus designs and observed clear modulations of spiking activity in early visual regions. I realize that this caveat may be more relevant for the spiking paper (which does not show any spiking activity modulation in V1 by global predictability) than for the current paper, but I still think it is an important general caveat to point out.”

      We appreciate the Reviewer’s concern about working memory limitations in mice. Our paradigm and training followed on from previous paradigms such as Gavornik and Bear (2014), in which predictive effects were observed in mouse V1 with presentation times of 150ms and interstimulus intervals of 1500ms. In addition, we note that Jamali et al. (2024) recently utilized a similar global/local paradigm in the auditory domain with inter-sequence intervals as long as 28-30 seconds, and still observed effects of a predicted sequence (https://elifesciences.org/articles/102702). For the revised manuscript, we plan to expand on this in the Discussion section.

      That being said, as the Reviewer also pointed out, this would be a greater concern had we not found any positive findings in our study. However, even with the rather long sequence periods we used, we did find positive evidence for predictive effects, supporting the use of our current paradigm. We agree with the reviewer that these positive effects are easier to interpret than negative effects, and plan to expand upon this in the Discussion when we resubmit.

      (4) Reporting of results:

      I did not see any quantification of the strength of evidence of any of the results, beyond a general statement that all reported results pass significance at an alpha=0.01 threshold. It would be informative to know, for all reported results, what exactly the p-value of the significant cluster is; as well as for which performed tests there was no significant difference.”

      For the revised manuscript, we can include the p-values after cluster-based testing for each significant cluster, as well as show data that passes a more stringent threshold of p<0.001 (1/1000) or p<0.005 (1/200) rather than our present p<0.01 (1/100).

      (5) Cluster test:

      The authors use a three-dimensional cluster test, clustering across time, frequency, and location/channel. I am wondering how meaningful this analytical approach is. For example, there could be clusters that show an early difference at some location in low frequencies, and then a later difference in a different frequency band at another (adjacent) location. It seems a priori illogical to me to want to cluster across all these dimensions together, given that this kind of clustering does not appear neurophysiologically implausible/not meaningful. Can the authors motivate their choice of three-dimensional clustering, or better, facilitating interpretability, cluster eg at space and time within specific frequency bands (2d clustering)?”

      We are happy to include a 3D plot of a time-channel-frequency cluster in the revised manuscript to clarify our statistical approach for the reviewer. We consider our current three-dimensional cluster-testing an “unsupervised” way of uncovering significant contrasts with no theory-driven assumptions about which bounded frequency bands or layers do what.

      Reviewer 2:

      Sennesh and colleagues analyzed LFP data from 6 regions of rodents while they were habituated to a stimulus sequence containing a local oddball (xxxy) and later exposed to either the same (xxxY) or a deviant global oddball (xxxX). Subsequently, they were exposed to a controlled random sequence (XXXY) or a controlled deterministic sequence (xxxx or yyyy). From these, the authors looked for differences in spectral properties (both oscillatory and aperiodic) between three contrasts (only for the last stimulus of the sequence).

      (1) Deviance detection: unpredictable random (XXXY) versus predictable habituation (xxxy)

      (2) Global oddball: unpredictable global oddball (xxxX) versus predictable deterministic (xxxx), and

      (3) "Stimulus-specific adaptation:" locally unpredictable oddball (xxxY) versus predictable deterministic (yyyy).

      They found evidence for an increase in gamma (and theta in some cases) for unpredictable versus predictable stimuli, and a reduction in alpha/beta, which they consider evidence towards the "predictive routing" scheme.

      While the dataset and analyses are well-suited to test evidence for predictive coding versus alternative hypotheses, I felt that the formulation was ambiguous, and the results were not very clear. My major concerns are as follows:”

      We appreciate the reviewer’s concerns and outline how we will address them below:

      (1) The authors set up three competing hypotheses, in which H1 and H2 make directly opposite predictions. However, it must be noted that H2 is proposed for spatial prediction, where the predictability is computed from the part of the image outside the RF. This is different from the temporal prediction that is tested here. Evidence in favor of H2 is readily observed when large gratings are presented, for which there is substantially more gamma than in small images. Actually, there are multiple features in the spectral domain that should not be conflated, namely (i) the transient broadband response, which includes all frequencies, (ii) contribution from the evoked response (ERP), which is often in frequencies below 30 Hz, (iii) narrow-band gamma oscillations which are produced by large and continuous stimuli (which happen to be highly predictive), and (iv) sustained low-frequency rhythms in theta and alpha/beta bands which are prominent before stimulus onset and reduce after ~200 ms of stimulus onset. The authors should be careful to incorporate these in their formulation of PC, and in particular should not conflate narrow-band and broadband gamma.”

      We have clarified in the manuscript that while the gamma-as-prediction hypothesis (our H2) was originally proposed in a spatial prediction domain, further work (specifically Singer (2021)) has extended the hypothesis to cover temporal-domain predictions as well.

      To address the reviewer’s point about multiple features in the spectral domain: Our analysis has specifically separated aperiodic components using FOOOF analysis (Supp. Fig. 1) and explicitly fit and tested aperiodic vs. periodic components (Supp. Figs 1&2). We did not find strong effects in the aperiodic components but did in the periodic components (Supp. Fig. 2), allowing us to be more confident in our conclusions in terms of genuine narrow-band oscillations. In the revised manuscript, we will include analysis of the pre-stimulus time window to address the reviewer’s point (iv) on sustained low frequency oscillations.

      (2) My understanding is that any aspect of predictive coding must be present before the onset of stimulus (expected or unexpected). So, I was surprised to see that the authors have shown the results only after stimulus onset. For all figures, the authors should show results from -500 ms to 500 ms instead of zero to 500 ms.

      In our revised manuscript we will include a pre-stimulus analysis and supplementary figures with time ranges from -500ms to 500ms. We have only refrained from doing so in the initial manuscript because our paradigm’s short interstimulus interval makes it difficult to interpret whether activity in the ISI reflects post-stimulus dynamics or pre-stimulus prediction. Nonetheless, we can easily show that in our paradigm, alpha/beta-band activity is elevated in the interstimulus activity after the offset of the previous stimulus, assuming that we baseline to the pre-trial period.

      (3) In many cases, some change is observed in the initial ~100 ms of stimulus onset, especially for the alpha/beta and theta ranges. However, the evoked response contributes substantially in the transient period in these frequencies, and this evoked response could be different for different conditions. The authors should show the evoked responses to confirm the same, and if the claim really is that predictions are carried by genuine "oscillatory" activity, show the results after removing the ERP (as they had done for the CSD analysis).

      We have included an extra sentence in our Materials and Methods section clarifying that the evoked potential/ERP was removed in our existing analyses, prior to performing the spectral decomposition of the LFP signal. We also note that the FOOOF analysis we applied separates aperiodic components of the spectral signal from the strictly oscillatory ones.

      In our revised manuscript we will include an analysis of the evoked responses as suggested by the reviewer.

      (4) I was surprised by the statistics used in the plots. Anything that is even slightly positive or negative is turning out to be significant. Perhaps the authors could use a more stringent criterion for multiple comparisons?

      As noted above to Reviewer 1 (point 4), we are happy to include supplemental figures in our resubmission showing the effects on our results of setting the statistical significance threshold with considerably greater stringency.

      (5) Since the design is blocked, there might be changes in global arousal levels. This is particularly important because the more predictive stimuli in the controlled deterministic stimuli were presented towards the end of the session, when the animal is likely less motivated. One idea to check for this is to do the analysis on the 3rd stimulus instead of the 4th? Any general effect of arousal/attention will be reflected in this stimulus.

      In order to check for the brain-wide effects of arousal, we plan to perform similar analyses to our existing ones on the 3rd stimulus in each block, rather than just the 4th “oddball” stimulus. Clusters that appear significantly contrasting in both the 3rd and 4th stimuli may be attributable to arousal.  We will also analyze pupil size as an index of arousal to check for arousal differences between conditions in our contrasts, possibly stratifying our data before performing comparisons to equalize pupil size within contrasts. We plan to include these analyses in our resubmission.

      (6) The authors should also acknowledge/discuss that typical stimulus presentation/attention modulation involves both (i) an increase in broadband power early on and (ii) a reduction in low-frequency alpha/beta power. This could be just a sensory response, without having a role in sending prediction signals per se. So the predictive routing hypothesis should involve testing for signatures of prediction while ruling out other confounds related to stimulus/cognition. It is, of course, very difficult to do so, but at the same time, simply showing a reduction in low-frequency power coupled with an increase in high-frequency power is not sufficient to prove PR.

      Since many different predictive coding and predictive processing hypotheses make very different hypotheses about how predictions might encoded in neurophysiological recordings, we have focused on prediction error encoding in this paper.

      For the hypothesis space we have considered (H1-H3), each hypothesis makes clearly distinguishable predictions about the spectral response during the time period in the task when prediction errors should be present. As noted by the reviewer, a transient increase in broadband frequencies would be a signature of H3. Changes to oscillatory power in the gamma band in distinct directions (e.g., increasing or decreasing with prediction error) would support either H1 and H2, depending on the direction of change. We believe our data, especially our use of FOOOF analysis and separation of periodic from aperiodic components, coupled to the three experimental contrasts, speaks clearly in favor of the Predictive Routing model, but we do not claim we have “proved” it. This study provides just one datapoint, and we will acknowledge this in our revised Discussion in our resubmission.

      (7) The CSD results need to be explained better - you should explain on what basis they are being called feedforward/feedback. Was LFP taken from Layer 4 LFP (as was done by van Kerkoerle et al, 2014)? The nice ">" and "<" CSD patterns (Figure 3B and 3F of their paper) in that paper are barely observed in this case, especially for the alpha/beta range.

      We consider a feedforward pattern as flowing from L4 outwards to L2/3 and L5/6, and a feedback pattern as flowing in the opposite direction, from L1 and L6 to the middle layers. We will clarify this in the revised manuscript.

      Since gamma-band oscillations are strongest in L2/3, we re-epoched LFPs to the oscillation troughs in L2/3 in the initial manuscript. We can include in the revised manuscript equivalent plots after finding oscillation troughs in L4 instead, as well as calculating the difference in trough times within-band between layers to quantify the transmission delay and add additional rigor to our feedforward vs. feedback interpretation of the CSD data.

      (8) Figure 4a-c, I don't see a reduction in the broadband signal in a compared to b in the initial segment. Maybe change the clim to make this clearer?

      We are looking into the clim/colorbar and plot-generation code to figure out the visibility issue that the Reviewer has kindly pointed out to us.

      (9) Figure 5 - please show the same for all three frequency ranges, show all bars (including the non-significant ones), and indicate the significance (p-values or by *, **, ***, etc) as done usually for bar plots.

      We will add the requested bar-plots for all frequency ranges, though we note that the bars given here are the results of adding up the spectral power in the channel-time-frequency clusters that already passed significance tests and that adding secondary significance tests here may not prove informative.

      (10) Their claim of alpha/beta oscillations being suppressed for unpredictable conditions is not as evident. A figure akin to Figure 5 would be helpful to see if this assertion holds.

      As noted above, we will include the requested bar plot, as well as examining alpha/beta in the pre-stimulus time-series rather than after the onset of the oddball stimulus.

      (11) To investigate the prediction and violation or confirmation of expectation, it would help to look at both the baseline and stimulus periods in the analyses.

      We will include for the Reviewer’s edification a supplementary figure showing the spectrograms for the baseline and full-trial periods to look at the difference between baseline and prestimulus expectation.

      Reviewer 3:

      Summary:

      In their manuscript entitled "Ubiquitous predictive processing in the spectral domain of sensory cortex", Sennesh and colleagues perform spectral analysis across multiple layers and areas in the visual system of mice. Their results are timely and interesting as they provide a complement to a study from the same lab focussed on firing rates, instead of oscillations. Together, the present study argues for a hypothesis called predictive routing, which argues that non-predictable stimuli are gated by Gamma oscillations, while alpha/beta oscillations are related to predictions.

      Strengths:

      (1) The study contains a clear introduction, which provides a clear contrast between a number of relevant theories in the field, including their hypotheses in relation to the present data set.

      (2) The study provides a systematic analysis across multiple areas and layers of the visual cortex.”

      We thank the Reviewer for their kind comments.

      Weaknesses:

      (1) It is claimed in the abstract that the present study supports predictive routing over predictive coding; however, this claim is nowhere in the manuscript directly substantiated. Not even the differences are clearly laid out, much less tested explicitly. While this might be obvious to the authors, it remains completely opaque to the reader, e.g., as it is also not part of the different hypotheses addressed. I guess this result is meant in contrast to reference 17, by some of the same authors, which argues against predictive coding, while the present work finds differences in the results, which they relate to spectral vs firing rate analysis (although without direct comparison).

      We agree that in this manuscript we should restrict ourselves to the hypotheses that were directly tested. We have revised our abstract accordingly,  and softened our claim to note only that our LFP results are compatible with predictive routing.

      (2) Most of the claims about a direction of propagation of certain frequency-related activities (made in the context of Figures 2-4) are - to the eyes of the reviewer - not supported by actual analysis but glimpsed from the pictures, sometimes, with very little evidence/very small time differences to go on. To keep these claims, proper statistical testing should be performed.

      In our revised manuscript, we will either substantiate (with quantification of CSD delays between layers) or soften the claims about feedforward/feedback direction of flow within the cortical column.

      (3) Results from different areas are barely presented. While I can see that presenting them in the same format as Figures 2-4 would be quite lengthy, it might be a good idea to contrast the right columns (difference plots) across areas, rather than just the overall averages.

      In our revised manuscript we will gladly include a supplementary figure showing the right-column difference plots across areas, in order to make sure to include aspects of our dataset that span up and down the cortical hierarchy.

      (4) Statistical testing is treated very generally, which can help to improve the readability of the text; however, in the present case, this is a bit extreme, with even obvious tests not reported or not even performed (in particular in Figure 5).

      We appreciate the Reviewer’s concern for statistical rigor, and as noted to the other reviewers, we can add different levels of statistical description and describe the p-values associated with specific clusters. Regarding Figure 5, we must protest as the bar heights were computed came from clusters already subjected to statistical testing and found significant.  We could add a supplementary figure which considers untested narrowband activity and tests it only in the “bar height” domain, if the Reviewer would like.

      (5) The description of the analysis in the methods is rather short and, to my eye, was missing one of the key descriptions, i.e., how the CSD plots were baselined (which was hinted at in the results, but, as far as I know, not clearly described in the analysis methods). Maybe the authors could section the methods more to point out where this is discussed.

      We have added some elaboration to our Materials and Methods section, especially to specify that CSD, having physical rather than arbitrary units, does not require baselining.

      (6) While I appreciate the efforts of the authors to formulate their hypotheses and test them clearly, the text is quite dense at times. Partly this is due to the compared conditions in this paradigm; however, it would help a lot to show a visualization of what is being compared in Figures 2-4, rather than just showing the results.

      In the revised manuscript we will add a visual aid for the three contrasts we consider.

      We are happy to inform the editors that we have implemented, for the Reviewed Preprint, the direct textual Recommendations for the Authors given by Reviewers 2 and 3. We will implement the suggested Figure changes in our revised manuscript. We thank them for their feedback in strengthening our manuscript.

    1. eLife Assessment

      Mark and colleagues developed and validated a valuable method for examining subspace generalization in fMRI data and applied it to understand whether the entorhinal cortex uses abstract representations that generalize across different environments with the same structure. The manuscript presents convincing evidence for the conclusion that abstract entorhinal representations of hexagonal associative structures generalize across different stimulus sets.

    2. Reviewer #1 (Public review):

      Summary:

      This study develops and validates a neural subspace similarity analysis for testing whether neural representations of graph structures generalize across graph size and stimulus sets. The authors show the method works in rat grid and place cell data, finding that grid but not place cells generalize across different environments, as expected. The authors then perform additional analyses and simulations to show that this method should also work on fMRI data. Finally, the authors test their method on fMRI responses from entorhinal cortex (EC) in a task that involves graphs that vary in size (and stimulus set) and statistical structure (hexagonal and community). They find neural representations of stimulus sets in lateral occipital complex (LOC) generalize across statistical structure and that EC activity generalizes across stimulus sets/graph size, but only for the hexagonal structures.

      Strengths:

      (1) The overall topic is very interesting and timely and the manuscript is well written.

      (2) The method is clever and powerful. It could be important for future research testing whether neural representations are aligned across problems with different state manifestations.

      (3) The findings provide new insights into generalizable neural representations of abstract task states in entorhinal cortex.

      Weaknesses:

      (1) There are two design confounds that are not sufficiently discussed.

      (1.1) First, hexagonal and community structures are confounded by training order. All subjects learned the hexagonal graph always before the community graph. As such, any differences between the two graphs could be explained (in theory) by order effects (although this is unlikely). However, because community and hexagonal structures shared the same stimuli, it is possible that subjects had to find ways to represent the community structures separately from the hexagonal structures. This could potentially explain why there was no generalization across graph size for community structures.

      (1.2) Second, subjects had more experience with the hexagonal and community structures before and during fMRI scanning. This is another possible reason why there was no generalization for the community structure.

      (2) The authors include the results from a searchlight analysis to show specificity of the effects for EC. A more convincing way (in my opinion) to show specificity would be to test for (and report the results) of a double dissociation between the visual and structural contrast in two independently defined regions (e.g., anatomical ROIs of LOC and EC). This would substantiate the point that EC activity generalizes across structural similarity while sensory regions like LOC generalize across visual similarity.

    3. Reviewer #2 (Public review):

      Summary:

      Mark and colleagues test the hypothesis that entorhinal cortical representations may contain abstract structural information that facilitates generalization across structurally similar contexts. To do so, they use a method called "subspace generalization" designed to measure abstraction of representations across different settings. The authors validate the method using hippocampal place cells and entorhinal grid cells recorded in a spatial task, then show perform simulations that support that it might be useful in aggregated responses such as those measured with fMRI. Then the method is applied to an fMRI data that required participants to learn relationships between images in one of two structural motifs (hexagonal grids versus community structure). They show that the BOLD signal within an entorhinal ROI shows increased measures of subspace generalization across different tasks with the same hexagonal structure (as compared to tasks with different structures) but that there was not evidence for the complementary result (ie. increased generalization across tasks that share community structure, as compared to those with different structures). Taken together, this manuscript describes and validates a method for identifying fMRI representations that generalize across conditions and applies it to reveal that entorhinal representations that emerge across specific shared structural conditions.

      Strengths:

      I found this paper interesting both in terms of its methods and its motivating questions. The question asked is novel and the methods employed are new - and I believe this is the first time that they have been applied to fMRI data. I also found the iterative validation of the methodology to be interesting and important - showing persuasively that the method could detect a target representation - even in the face of random combination of tuning and with the addition of noise, both being major hurdles to investigating representations using fMRI.

      Weaknesses:

      The primary weakness of the paper in terms of empirical results is that the representations identified in EC had no clear relationship to behavior, raising questions about their functional importance.

      The method developed is a clearly valuable tool that can serve as part of a larger battery of analysis techniques, but a small weakness on the methodological side is that for a given dataset, it might be hard to determine whether the method developed here would be better or worse than alternative methods.

    4. Reviewer #3 (Public review):

      Summary:

      The article explores the brain's ability to generalize information, with a specific focus on the entorhinal cortex (EC) and its role in learning and representing structural regularities that define relationships between entities in networks. The research provides empirical support for the longstanding theoretical and computational neuroscience hypothesis that the EC is crucial for structure generalization. It demonstrates that EC codes can generalize across non-spatial tasks that share common structural regularities, regardless of the similarity of sensory stimuli and network size.

      Strengths:

      At first glance, a potential limitation of this study appears to be its application of analytical methods originally developed for high-resolution animal electrophysiology (Samborska et al., 2022) to the relatively coarse and noisy signals of human fMRI. Rather than sidestepping this issue, however, the authors embrace it as a methodological challenge. They provide compelling empirical evidence and biologically grounded simulations to show that key generalization properties of entorhinal cortex representations can still be robustly detected. This not only validates their approach but also demonstrates how far non-invasive human neuroimaging can be pushed. The use of multiple independent datasets and carefully controlled permutation tests further underscores the reliability of their findings, making a strong case that structural generalization across diverse task environments can be meaningfully studied even in abstract, non-spatial domains that are otherwise difficult to investigate in animal models.

      Weaknesses:

      While this study provides compelling evidence for structural generalization in the entorhinal cortex (EC), several limitations remain that pave the way for promising future research. One issue is that the generalization effect was statistically robust in only one task condition, with weaker effects observed in the "community" condition. This raises the question of whether the null result genuinely reflects a lack of EC involvement, or whether it might be attributable to other factors such as task complexity, training order, or insufficient exposure possibilities that the authors acknowledge as open questions. Moreover, although the study leverages fMRI to examine EC representations in humans, it does not clarify which specific components of EC coding-such as grid cells versus other spatially tuned but non-grid codes-underlie the observed generalization. While electrophysiological data in animals have begun to address this, the human experiments do not disentangle the contributions of these different coding types. This leaves unresolved the important question of what makes EC representations uniquely suited for generalization, particularly given that similar effects were not observed in other regions known to contain grid cells, such as the medial prefrontal cortex (mPFC) or posterior cingulate cortex (PCC). These limitations point to important future directions for better characterizing the computational role of the EC and its distinctiveness within the broader network supporting learning and decision making based on cognitive maps.

    5. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study develops and validates a neural subspace similarity analysis for testing whether neural representations of graph structures generalize across graph size and stimulus sets. The authors show the method works in rat grid and place cell data, finding that grid but not place cells generalize across different environments, as expected. The authors then perform additional analyses and simulations to show that this method should also work on fMRI data. Finally, the authors test their method on fMRI responses from the entorhinal cortex (EC) in a task that involves graphs that vary in size (and stimulus set) and statistical structure (hexagonal and community). They find neural representations of stimulus sets in lateral occipital complex (LOC) generalize across statistical structure and that EC activity generalizes across stimulus sets/graph size, but only for the hexagonal structures.

      Strengths:

      (1) The overall topic is very interesting and timely and the manuscript is well-written.

      (2) The method is clever and powerful. It could be important for future research testing whether neural representations are aligned across problems with different state manifestations.

      (3) The findings provide new insights into generalizable neural representations of abstract task states in the entorhinal cortex.

      We thank the reviewer for their kind comments and clear summary of the paper and its strengths.

      Weaknesses:

      (1) The manuscript would benefit from improving the figures. Moreover, the clarity could be strengthened by including conceptual/schematic figures illustrating the logic and steps of the method early in the paper. This could be combined with an illustration of the remapping properties of grid and place cells and how the method captures these properties.

      We agree with the reviewer and have added a schematic figure of the method (figure 1a).

      (2) Hexagonal and community structures appear to be confounded by training order. All subjects learned the hexagonal graph always before the community graph. As such, any differences between the two graphs could thus be explained (in theory) by order effects (although this is practically unlikely). However, given community and hexagonal structures shared the same stimuli, it is possible that subjects had to find ways to represent the community structures separately from the hexagonal structures. This could potentially explain why the authors did not find generalizations across graph sizes for community structures.

      We thank the reviewer for their comments. We agree that the null result regarding the community structures does not mean that EC doesn’t generalise over these structures, and that the training order could in theory contribute to the lack of an effect. The decision to keep the asymmetry of the training order was deliberate: we chose this order based on our previous study (Mark et al. 2020), where we show that learning a community structure first changes the learning strategy of subsequent graphs. We could have perhaps overcome this by increasing the training periods, but 1) the training period is already very long; 2) there will still be asymmetry because the group that first learn community structure will struggle in learning the hexagonal graph more than vice versa, as shown in Mark et al. 2020.

      We have added the following sentences on this decision to the Methods section:

      “We chose to first teach hexagonal graphs for all participants and not randomize the order because of previous results showing that first learning community structure changes participants’ learning strategy (mark et al. 2020).”

      (3) The authors include the results from a searchlight analysis to show the specificity of the effects of EC. A better way to show specificity would be to test for a double dissociation between the visual and structural contrast in two independently defined regions (e.g., anatomical ROIs of LOC and EC).

      Thanks for this suggestion. We indeed tried to run the analysis in a whole-ROI approach, but this did not result in a significant effect in EC. Importantly, we disagree with the reviewer that this is a “better way to show specificity” than the searchlight approach. In our view, the two analyses differ with respect to the spatial extent of the representation they test for. The searchlight approach is testing for a highly localised representation on the scale of small spheres with only 100 voxels. The signal of such a localised representation is likely to be drowned in the noise in an analysis that includes thousands of voxels which mostly don’t show the effect - as would be the case in the whole-ROI approach.

      (4) Subjects had more experience with the hexagonal and community structures before and during fMRI scanning. This is another confound, and possible reason why there was no generalization across stimulus sets for the community structure.

      See our response to comment (2).

      Reviewer #2 (Public review):

      Summary:

      Mark and colleagues test the hypothesis that entorhinal cortical representations may contain abstract structural information that facilitates generalization across structurally similar contexts. To do so, they use a method called "subspace generalization" designed to measure abstraction of representations across different settings. The authors validate the method using hippocampal place cells and entorhinal grid cells recorded in a spatial task, then perform simulations that support that it might be useful in aggregated responses such as those measured with fMRI. Then the method is applied to fMRI data that required participants to learn relationships between images in one of two structural motifs (hexagonal grids versus community structure). They show that the BOLD signal within an entorhinal ROI shows increased measures of subspace generalization across different tasks with the same hexagonal structure (as compared to tasks with different structures) but that there was no evidence for the complementary result (ie. increased generalization across tasks that share community structure, as compared to those with different structures). Taken together, this manuscript describes and validates a method for identifying fMRI representations that generalize across conditions and applies it to reveal entorhinal representations that emerge across specific shared structural conditions.

      Strengths:

      I found this paper interesting both in terms of its methods and its motivating questions. The question asked is novel and the methods employed are new - and I believe this is the first time that they have been applied to fMRI data. I also found the iterative validation of the methodology to be interesting and important - showing persuasively that the method could detect a target representation - even in the face of a random combination of tuning and with the addition of noise, both being major hurdles to investigating representations using fMRI.

      We thank the reviewer for their kind comments and the clear summary of our paper.

      Weaknesses:

      In part because of the thorough validation procedures, the paper came across to me as a bit of a hybrid between a methods paper and an empirical one. However, I have some concerns, both on the methods development/validation side, and on the empirical application side, which I believe limit what one can take away from the studies performed.

      We thank the reviewer for the comment. We agree that the paper comes across as a bit of a methods-empirical hybrid. We chose to do this because we believe (as the reviewer also points out) that there is value in both aspects of the paper.

      Regarding the methods side, while I can appreciate that the authors show how the subspace generalization method "could" identify representations of theoretical interest, I felt like there was a noticeable lack of characterization of the specificity of the method. Based on the main equation in the results section of the paper, it seems like the primary measure used here would be sensitive to overall firing rates/voxel activations, variance within specific neurons/voxels, and overall levels of correlation among neurons/voxels. While I believe that reasonable pre-processing strategies could deal with the first two potential issues, the third seems a bit more problematic - as obligate correlations among neurons/voxels surely exist in the brain and persist across context boundaries that are not achieving any sort of generalization (for example neurons that receive common input, or voxels that share spatial noise). The comparative approach (ie. computing difference in the measure across different comparison conditions) helps to mitigate this concern to some degree - but not completely - since if one of the conditions pushes activity into strongly spatially correlated dimensions, as would be expected if univariate activations were responsive to the conditions, then you'd expect generalization (driven by shared univariate activation of many voxels) to be specific to that set of conditions.

      We thank the reviewer for their comments. We would like to point out that we demean each voxel within all states/piles (3-pictures sequences) in a given graph/task (what the reviewer is calling “a condition”). Hence there is no shared univariate activation of many voxels in response to a graph going into the computation, and no sensitivity to the overall firing rate/voxel activation.  Our calculation captures the variance across states conditions within a task (here a graph), over and above the univariate effect of graph activity. In addition, we spatially pre-whiten the data within each searchlight, meaning that noisy voxels with high noise variance will be downweighted and noise correlations between voxels are removed prior to applying our method.

      A second issue in terms of the method is that there is no comparison to simpler available methods. For example, given the aims of the paper, and the introduction of the method, I would have expected the authors to take the Neuron-by-Neuron correlation matrices for two conditions of interest, and examine how similar they are to one another, for example by correlating their lower triangle elements. Presumably, this method would pick up on most of the same things - although it would notably avoid interpreting high overall correlations as "generalization" - and perhaps paint a clearer picture of exactly what aspects of correlation structure are shared. Would this method pick up on the same things shown here? Is there a reason to use one method over the other?

      We thank the reviewer for this important and interesting point. We agree that calculating correlation between the upper triangular elements of the covariance or correlation matrices picks up similar, but not identical aspects of the data (see below the mathematical explanation that was added to the supplementary). When we repeated the searchlight analysis and calculated the correlation between the upper triangular entries of the Pearson correlation matrices we obtained an effect in the EC, though weaker than with our subspace generalization method (t=3.9, the effect did not survive multiple comparisons). Similar results were obtained with the correlation between the upper triangular elements of the covariance matrices(t=3.8, the effect did not survive multiple comparisons).

      The difference between the two methods is twofold: 1) Our method is based on the covariance matrix and not the correlation matrix - i.e. a difference in normalisation. We realised that in the main text of the original paper we mistakenly wrote “correlation matrix” rather than “covariance matrix” (though our equations did correctly show the covariance matrix). We have corrected this mistake in the revised manuscript. 2) The weighting of the variance explained in the direction of each eigenvector is different between the methods, with some benefits of our method for identifying low-dimensional representations and for robustness to strong spatial correlations.  We have added a section “Subspace Generalisation vs correlating the Neuron-by-Neuron correlation matrices” to the supplementary information with a mathematical explanation of these differences.

      Regarding the fMRI empirical results, I have several concerns, some of which relate to concerns with the method itself described above. First, the spatial correlation patterns in fMRI data tend to be broad and will differ across conditions depending on variability in univariate responses (ie. if a condition contains some trials that evoke large univariate activations and others that evoke small univariate activations in the region). Are the eigenvectors that are shared across conditions capturing spatial patterns in voxel activations? Or, related to another concern with the method, are they capturing changing correlations across the entire set of voxels going into the analysis? As you might expect if the dynamic range of activations in the region is larger in one condition than the other?

      This is a searchlight analysis, therefore it captures the activity patterns within nearby voxels. Indeed, as we show in our simulation, areas with high activity and therefore high signal to noise will have better signal in our method as well. Note that this is true of most measures.

      My second concern is, beyond the specificity of the results, they provide only modest evidence for the key claims in the paper. The authors show a statistically significant result in the Entorhinal Cortex in one out of two conditions that they hypothesized they would see it. However, the effect is not particularly large. There is currently no examination of what the actual eigenvectors that transfer are doing/look like/are representing, nor how the degree of subspace generalization in EC may relate to individual differences in behavior, making it hard to assess the functional role of the relationship. So, at the end of the day, while the methods developed are interesting and potentially useful, I found the contributions to our understanding of EC representations to be somewhat limited.

      We agree with this point, yet believe that the results still shed light on EC functionality. Unfortunately, we could not find correlation between behavioral measures and the fMRI effect.

      Reviewer #3 (Public review):

      Summary:

      The article explores the brain's ability to generalize information, with a specific focus on the entorhinal cortex (EC) and its role in learning and representing structural regularities that define relationships between entities in networks. The research provides empirical support for the longstanding theoretical and computational neuroscience hypothesis that the EC is crucial for structure generalization. It demonstrates that EC codes can generalize across non-spatial tasks that share common structural regularities, regardless of the similarity of sensory stimuli and network size.

      Strengths:

      (1) Empirical Support: The study provides strong empirical evidence for the theoretical and computational neuroscience argument about the EC's role in structure generalization.

      (2) Novel Approach: The research uses an innovative methodology and applies the same methods to three independent data sets, enhancing the robustness and reliability of the findings.

      (3) Controlled Analysis: The results are robust against well-controlled data and/or permutations.

      (4) Generalizability: By integrating data from different sources, the study offers a comprehensive understanding of the EC's role, strengthening the overall evidence supporting structural generalization across different task environments.

      Weaknesses:

      A potential criticism might arise from the fact that the authors applied innovative methods originally used in animal electrophysiology data (Samborska et al., 2022) to noisy fMRI signals. While this is a valid point, it is noteworthy that the authors provide robust simulations suggesting that the generalization properties in EC representations can be detected even in low-resolution, noisy data under biologically plausible assumptions. I believe this is actually an advantage of the study, as it demonstrates the extent to which we can explore how the brain generalizes structural knowledge across different task environments in humans using fMRI. This is crucial for addressing the brain's ability in non-spatial abstract tasks, which are difficult to test in animal models.

      While focusing on the role of the EC, this study does not extensively address whether other brain areas known to contain grid cells, such as the mPFC and PCC, also exhibit generalizable properties. Additionally, it remains unclear whether the EC encodes unique properties that differ from those of other systems. As the authors noted in the discussion, I believe this is an important question for future research.

      We thank the reviewer for their comments. We agree with the reviewer that this is a very interesting question. We tried to look for effects in the mPFC, but we did not obtain results that were strong enough to report in the main manuscript, but we do report a small effect in the supplementary.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) I wonder how important the PCA on B1(voxel-by-state matrix from environment 1) and the computation of the AUC (from the projection on B2 [voxel-by-state matrix from environment 1]) is for the analysis to work. Would you not get the same result if you correlated the voxel-by-voxel correlation matrix based on B1 (C1) with the voxel-by-voxel correlation matrix based on B2 (C2)? I understand that you would not have the subspace-by-subspace resolution that comes from the individual eigenvectors, but would the AUC not strongly correlate with the correlation between C1 and C2?

      We agree with the reviewer comments - see our response to reviewer 2 second issue above. 

      (2) There is a subtle difference between how the method is described for the neural recording and fMRI data. Line 695 states that principal components of the neuron x neuron intercorrelation matrix are computed, whereas line 888 implies that principal components of the data matrix B are computed. Of note, B is a voxel x pile rather than a pile x voxel matrix. Wouldn't this result in U being pile x pile rather than voxel x voxel?

      The PCs are calculated on the neuron x neuron (or voxel x voxel) covariance matrix of the activation matrix. We’ve added the following clarification to the relevant part of the Methods:

      “We calculated noise normalized GLM betas within each searchlight using the RSA toolbox. For each searchlight and each graph, we had a nVoxels (100) by nPiles (10) activation matrix (B) that describes the activation of a voxel as a result of a particular pile (three pictures’ sequence). We exploited the (voxel x voxel) covariance matrix of this matrix to quantify the manifold alignment within each searchlight.”

      (3) It would be very helpful to the field if the authors would make the code and data publicly available. Please consider depositing the code for data analysis and simulations, as well as the preprocessed/extracted data for the key results (rat data/fMRI ROI data) into a publicly accessible repository.

      The code is publicly available in git (https://github.com/ShirleyMgit/subspace_generalization_paper_code/tree/main).

      (4) Line 219: "Kolmogorov Simonov test" should be "Kolmogorov Smirnov test".

      thanks!

      (5) Please put plots in Figure 3F on the same y-axis.

      (6) Were large and small graphs of a given statistical structure learned on the same days, and if so, sequentially or simultaneously? This could be clarified.

      The graphs are learned on the same day.  We clarified this in the Methods section.

      Reviewer #2 (Recommendations for the authors):

      Perhaps the advantage of the method described here is that you could narrow things down to the specific eigenvector that is doing the heavy lifting in terms of generalization... and then you could look at that eigenvector to see what aspect of the covariance structure persists across conditions of interest. For example, is it just the highest eigenvalue eigenvector that is likely picking up on correlations across the entire neural population? Or is there something more specific going on? One could start to get at this by looking at Figures 1A and 1C - for example, the primary difference for within/between condition generalization in 1C seems to emerge with the first component, and not much changes after that, perhaps suggesting that in this case, the analysis may be picking up on something like the overall level of correlations within different conditions, rather than a more specific pattern of correlations.

      The nature of the analysis means the eigenvectors are organized by their contribution to the variance, therefore the first eigenvector is responsible for more variance than the other, we did not check rigorously whether the variance is then splitted equally by the remaining eigenvectors but it does not seems to be the case.

      Why is variance explained above zero for fraction EVs = 0 for figure 1C (but not 1A) ? Is there some plotting convention that I'm missing here?

      There was a small bug in this plot and it was corrected - thank you very much!

      The authors say:

      "Interestingly, the difference in AUCs was also 190 significantly smaller than chance for place cells (Figure 1a, compare dotted and solid green 191 lines, p<0.05 using permutation tests, see statistics and further examples in supplementary 192 material Figure S2), consistent with recent models predicting hippocampal remapping that is 193 not fully random (Whittington et al. 2020)."

      But my read of the Whittington model is that it would predict slight positive relationships here, rather than the observed negative ones, akin to what one would expect if hippocampal neurons reflect a nonlinear summation of a broad swath of entorhinal inputs.

      Smaller differences than chance imply that the remapping of place cells is not completely random.

      Figure 2:

      I didn't see any description of where noise amplitude values came from - or any justification at all in that section. Clearly, the amount of noise will be critical for putting limits on what can and cannot be detected with the method - I think this is worthy of characterization and explanation. In general, more information about the simulations is necessary to understand what was done in the pseudovoxel simulations. I get the gist of what was done, but these methods should clear enough that someone could repeat them, and they currently are not.

      Thanks, we added noise amplitude to the figure legend and Methods.

      What does flexible mean in the title? The analysis only worked for the hexagonal grid - doesn't that suggest that whatever representations are uncovered here are not flexible in the sense of being able to encode different things?

      Flexible here means, flexible over stimulus’ characteristics that are not related to the structural form such as stimuli, the size of the graph etc.

      Reviewer #3 (Recommendations for the authors):

      I have noticed that the authors have updated the previous preprint version to include extensive simulations. I believe this addition helps address potential criticisms regarding the signal-to-noise ratio. If the authors could share the code for the fMRI data and the simulations in an open repository, it would enhance the study's impact by reaching a broader readership across various research fields. Except for that, I have nothing to ask for revision.

      Thanks, the code will be publicly available: (https://github.com/ShirleyMgit/subspace_generalization_paper_code/tree/main).

    1. eLife Assessment

      This important study advances our understanding of population-level immune responses to influenza in both children and adults. The strength of the evidence supporting the conclusions is compelling, with high-throughput profiling assays and mathematical modeling. The work will be of interest to immunologists, virologists, vaccine developers, and those working on mathematical modeling of infectious diseases.

    2. Reviewer #1 (Public review):

      The authors present exciting new experimental data on the antigenic recognition of 78 H3N2 strains (from the beginning of the 2023 Northern Hemisphere season) against a set of 150 serum samples. The authors compare protection profiles of individual sera and find that the antigenic effect of amino acid substitutions at specific sites depends on the immune class of the sera, differentiating between children and adults. Person-to-person heterogeneity in the measured titers is strong, specifically in the group of children's sera. The authors find that the fraction of sera with low titers correlates with the inferred growth rate using maximum likelihood regression (MLR), a correlation that does not hold for pooled sera. The authors then measure the protection profile of the sera against historical vaccine strains and find that it can be explained by birth cohort for children. Finally, the authors present data comparing pre- and post- vaccination protection profiles for 39 (USA) and 8 (Australia) adults. The data shows a cohort-specific vaccination effect as measured by the average titer increase, and also a virus-specific vaccination effect for the historical vaccine strains. The generated data is shared by the authors and they also note that these methods can be applied to inform the bi-annual vaccine composition meetings, which could be highly valuable.

      Thanks to the authors for the revised version of the manuscript. A few concerns remain after the revision:

      (1) We appreciate the additional computational analysis the authors have performed on normalizing the titers with the geometric mean titer for each individual, as shown in the new Supplemental Figure 6. We agree with the authors statement that, after averaging again within specific age groups, "there are no obvious age group-specific patterns." A discussion of this should be added to the revised manuscript, for example in the section "Pooled sera fail to capture the heterogeneity of individual sera," referring to the new Supplemental Figure 6.

      However, we also suggested that after this normalization, patterns might emerge that are not necessarily defined by birth cohort. This possibility remains unexplored and could provide an interesting addition to support potential effects of substitutions at sites 145 and 275/276 in individuals with specific titer profiles, which as stated above do not necessarily follow birth cohort patterns.

      (2) Thank you for elaborating further on the method used to estimate growth rates in your reply to the reviewers. To clarify: the reason that we infer from Fig. 5a that A/Massachusetts has a higher fitness than A/Sydney is not because it reaches a higher maximum frequency, but because it seems to have a higher slope. The discrepancy between this plot and the MLR inferred fitness could be clarified by plotting the frequency trajectories on a log-scale.

      For the MLR, we understand that the initial frequency matters in assessing a variant's growth. However, when starting points of two clades differ in time (i.e., in different contexts of competing clades), this affects comparability, particularly between A/Massachusetts and A/Ontario, as well as for other strains. We still think that mentioning these time-dependent effects, which are not captured by the MLR analysis, would be appropriate. To support this, it could be helpful to include the MLR fits as an appendix figure, showing the different starting and/or time points used.

      (3) Regarding my previous suggestion to test an older vaccine strain than A/Texas/50/2012 to assess whether the observed peak in titer measurements is virus-specific: We understand that the authors want to focus the scope of this paper on the relative fitness of contemporary strains, and that this additional experimental effort would go beyond the main objectives outlined in this manuscript. However, the authors explicitly note that "Adults across age groups also have their highest titers to the oldest vaccine strain tested, consistent with the fact that these adults were first imprinted by exposure to an older strain." This statement gives the impression that imprinting effects increase titers for older strains, whereas this does not seem to be true from their results, but only true for A/Texas. It should be modified accordingly.

    3. Reviewer #2 (Public review):

      This is an excellent paper. The ability to measure the immune response to multiple viruses in parallel is a major advancement for the field, that will be relevant across pathogens (assuming the assay can be appropriately adapted). I only had a few comments, focused on maximising the information provided by the sera. These concerns were all addressed in the revised paper.

    4. Reviewer #3 (Public review):

      The authors use high throughput neutralisation data to explore how different summary statistics for population immune responses relate to strain success, as measured by growth rate during the 2023 season. The question of how serological measurements relate to epidemic growth is an important one, and I thought the authors present a thoughtful analysis tackling this question, with some clear figures. In particular, they found that stratifying the population based on the magnitude of their antibody titres correlates more with strain growth than using measurements derived from pooled serum data. The updated manuscript has a stronger motivation, and there is substantial potential to build on this work in future research.

      Comments on revisions:

      I have no additional recommendations. There are several areas where the work could be further developed, which were not addressed in detail in the responses, but given this is a strong manuscript as it stands, it is fine that these aspects are for consideration only at this point.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The authors present exciting new experimental data on the antigenic recognition of 78 H3N2 strains (from the beginning of the 2023 Northern Hemisphere season) against a set of 150 serum samples. The authors compare protection profiles of individual sera and find that the antigenic effect of amino acid substitutions at specific sites depends on the immune class of the sera, differentiating between children and adults. Person-to-person heterogeneity in the measured titers is strong, specifically in the group of children's sera. The authors find that the fraction of sera with low titers correlates with the inferred growth rate using maximum likelihood regression (MLR), a correlation that does not hold for pooled sera. The authors then measure the protection profile of the sera against historical vaccine strains and find that it can be explained by birth cohort for children. Finally, the authors present data comparing pre- and post- vaccination protection profiles for 39 (USA) and 8 (Australia) adults. The data shows a cohort-specific vaccination effect as measured by the average titer increase, and also a virus-specific vaccination effect for the historical vaccine strains. The generated data is shared by the authors and they also note that these methods can be applied to inform the bi-annual vaccine composition meetings, which could be highly valuable.

      Thanks for this nice summary of our paper.

      The following points could be addressed in a revision:

      (1) The authors conclude that much of the person-to-person and strain-to-strain variation seems idiosyncratic to individual sera rather than age groups. This point is not yet fully convincing. While the mean titer of an individual may be idiosyncratic to the individual sera, the strain-to-strain variation still reveals some patterns that are consistent across individuals (the authors note the effects of substitutions at sites 145 and 275/276). A more detailed analysis, removing the individual-specific mean titer, could still show shared patterns in groups of individuals that are not necessarily defined by the birth cohort.

      As the reviewer suggests, we normalized the titers for all sera to the geometric mean titer for each individual in the US-based pre-vaccination adults and children. This is only for the 2023-circulating viral strains. We then faceted these normalized titers by the same age groups we used in Figure 6, and the resulting plot is shown. Although there are differences among virus strains (some are better neutralized than others), there are not obvious age group-specific patterns (eg, the trends in the two facets are similar). This observation suggests that at least for these relatively closely related recent H3N2 strains, the strain-to-strain variation does not obviously segregate by age group. Obviously, it is possible (we think likely) that there would be more obvious age-group specific trends if we looked at a larger swath of viral strains covering a longer time range (eg, over decades of influenza evolution). We have added the new plots shown as a Supplemental Figure 6 in the revised manuscript.

      (2) The authors show that the fraction of sera with a titer 138 correlates strongly with the inferred growth rate using MLR. However, the authors also note that there exists a strong correlation between the MLR growth rate and the number of HA1 mutations. This analysis does not yet show that the titers provide substantially more information about the evolutionary success. The actual relation between the measured titers and fitness is certainly more subtle than suggested by the correlation plot in Figure 5. For example, the clades A/Massachusetts and A/Sydney both have a positive fitness at the beginning of 2023, but A/Massachusetts has substantially higher relative fitness than A/Sydney. The growth inference in Figure 5b does not appear to map that difference, and the antigenic data would give the opposite ranking. Similarly, the clades A/Massachusetts and A/Ontario have both positive relative fitness, as correctly identified by the antigenic ranking, but at quite different times (i.e., in different contexts of competing clades). Other clades, like A/St. Petersburg are assigned high growth and high escape but remain at low frequency throughout. Some mention of these effects not mapped by the analysis may be appropriate.

      Thanks for the nice summary of our findings in Figure 5. However, the reviewer is misreading the growth charts when they say that A/Massachusetts/18/2022 has a substantially higher fitness than A/Sydney/332/2023. Figure 5a (reprinted at left panel) shows the frequency trajectory of different variants over time. While A/Massachusetts/18/2022 reaches a higher frequency than A/Sydney/332/2023, the trajectory is similar and the reason that A/Massachusetts/18/2022 reached a higher max frequency is that it started at a higher frequency at the beginning of 2023. The MLR growth rate estimates differ from the maximum absolute frequency reached: instead, they reflect how rapidly each strain grows relative to others. In fact, A/Massachusetts/18/2022 and A/Sydney/332/2023 have similar growth rates, as shown in Supplemental Figure 6b (reprinted at right). Similarly, A/Saint-Petersburg/RII-166/2023 starts at a low initial frequency but then grows even as A/Massachusetts/18/2022 and A/Sydney/332/2023 are declining, and so has a higher growth rate than both of those. 

      In the revised manuscript, we have clarified how viral growth rates are estimated from frequency trajectories, and how growth rate differs from max frequency in the text below:

      “To estimate the evolutionary success of different human H3N2 influenza strains during 2023, we used multinomial logistic regression, which analyzes strain frequencies over time to calculate strain-specific relative growth rates [51–53]. There were sufficient sequencing counts to reliably estimate growth rates in 2023 for 12 of the HAs for which we measured titers using our sequencing-based neutralization assay libraries (Figure 5a,b and Supplemental Figure 9a,b). Note that these growth rates estimate how rapidly each strain grows relative to the other strains, rather than the absolute highest frequency reached by each strain “.  

      (3) For the protection profile against the vaccine strains, the authors find for the adult cohort that the highest titer is always against the oldest vaccine strain tested, which is A/Texas/50/2012. However, the adult sera do not show an increase in titer towards older strains, but only a peak at A/Texas. Therefore, it could be that this is a virus-specific effect, rather than a property of the protection profile. Could the authors test with one older vaccine virus (A/Perth/16/2009?) whether this really can be a general property?

      We are interested in studying immune imprinting more thoroughly using sequencing-based neutralization assays, but we note that the adults in the cohorts we studied would have been imprinted with much older strains than included in this library. As this paper focuses on the relative fitness of contemporary strains with minor secondary points regarding imprinting, these experiments are beyond the scope of this study. We’re excited for future work (from our group or others) to explore these points by making a new virus library with strains from multiple decades of influenza evolution. 

      Reviewer #2 (Public review):

      This is an excellent paper. The ability to measure the immune response to multiple viruses in parallel is a major advancement for the field, which will be relevant across pathogens (assuming the assay can be appropriately adapted). I only have a few comments, focused on maximising the information provided by the sera.

      Thanks very much!

      Firstly, one of the major findings is that there is wide heterogeneity in responses across individuals. However, we could expect that individuals' responses should be at least correlated across the viruses considered, especially when individuals are of a similar age. It would be interesting to quantify the correlation in responses as a function of the difference in ages between pairs of individuals. I am also left wondering what the potential drivers of the differences in responses are, with age being presumably key. It would be interesting to explore individual factors associated with responses to specific viruses (beyond simply comparing adults versus children).

      We thank the reviewer for this interesting idea. We performed this analysis (and the related analyses described) and added this as a new Supplemental Figure 7, which is pasted after the response to the next related comment by the reviewer. 

      For 2023-circulating strains, we observed basically no correlation between the strength of correlation between pairs of sera and the difference in age between those pairs of sera (Supplemental Figure 7), which was unsurprising given the high degree of heterogeneity between individual sera (Figure 3, Supplemental Figure 6, and Supplemental Figure 8). For vaccine strains, there is a moderate negative correlation only in the children, but not in the adults or the combined group of adults and children. This could be because the children are younger with limited and potentially more similar vaccine and exposure histories than the adults. It could also be because the children are overall closer in age than the adults.

      Relatedly, is the phylogenetic distance between pairs of viruses associated with similarity in responses?

      For 2023-circulating strains, across sera cohorts we observed a weak-to-moderate correlation between the strength of correlation between the neutralizing titers across all sera to pairs of viruses and the Hamming distances between virus pairs. For the same comparison with vaccine strains, we observed moderate correlations, but this must be caveated with the slightly larger range of Hamming distances between vaccine strains. Notably, many of the points on the negative correlation slope are a mix of egg- and cell-produced vaccine strains from similar years, but there are some strain comparisons where the same year’s egg- and cell-produced vaccine strains correlate poorly.

      Figure 5C is also a really interesting result. To be able to predict growth rates based on titers in the sera is fascinating. As touched upon in the discussion, I suspect it is really dependent on the representativeness of the sera of the population (so, e.g., if only elderly individuals provided sera, it would be a different result than if only children provided samples). It may be interesting to compare different hypotheses - so e.g., see if a population-weighted titer is even better correlated with fitness - so the contribution from each individual's titer is linked to a number of individuals of that age in the population. Alternatively, maybe only the titers in younger individuals are most relevant to fitness, etc.

      We’re very interested in these analyses, but suggest they may be better explored in subsequent works that could sample more children, teenagers and adults across age groups. Our sera set, as the reviewer suggests, may be under-powered to perform the proposed analysis on subsetted age groups of our larger age cohorts. 

      In Figure 6, the authors lump together individuals within 10-year age categories - however, this is potentially throwing away the nuances of what is happening at individual ages, especially for the children, where the measured viruses cross different groups. I realise the numbers are small and the viruses only come from a small numbers of years, however, it may be preferable to order all the individuals by age (y-axis) and the viral responses in ascending order (x-axis) and plot the response as a heatmap. As currently plotted, it is difficult to compare across panels

      This is a good suggestion. In the revised manuscript we have included a heatmap of the children and pre-vaccination adults, ordered by the year of birth of each individual, as Supplemental figure 8. That new figure is also pasted in this response.

      Reviewer #3 (Public review):

      The authors use high-throughput neutralisation data to explore how different summary statistics for population immune responses relate to strain success, as measured by growth rate during the 2023 season. The question of how serological measurements relate to epidemic growth is an important one, and I thought the authors present a thoughtful analysis tackling this question, with some clear figures. In particular, they found that stratifying the population based on the magnitude of their antibody titres correlates more with strain growth than using measurements derived from pooled serum data. However, there are some areas where I thought the work could be more strongly motivated and linked together. In particular, how the vaccine responses in US and Australia in Figures 6-7 relate to the earlier analysis around growth rates, and what we would expect the relationship between growth rate and population immunity to be based on epidemic theory.

      Thank you for this nice summary. This reviewer also notes that the text related to figures 6 and 7 are more secondary to the main story presented in figures 3-5. The main motivation for including figures 6 and 7 were to demonstrate the wide-ranging applications of sequencing-based neutralization data. We have tried to clarify this with the following minor text revisions, which do not add new content but we hope smooth the transition between results sections. 

      While the preceding analyses demonstrated the utility of sequencing-based neutralization assays for measuring titers of currently circulating strains, our library also included viruses with HAs from each of the H3N2 influenza Northern Hemisphere vaccine strains from the last decade (2014 to 2024, see Supplemental Table 1). These historical vaccine strains cover a much wider span of evolutionary diversity that the 2023-circulating strains analyzed in the preceding sections (Figure 2a,b and Supplemental Figure 2b-e). For this analysis, we focused on the cell-passaged strains for each vaccine, as these are more antigenically similar to their contemporary circulating strains than the egg-passaged vaccine strains since they lack the mutations that arise during growth of viruses in eggs [55–57] (Supplemental Table 1). 

      Our sequencing-based assay could also be used to assess the impact of vaccination on neutralization titers against the full set of strains in our H3N2 library. To do this, we analyzed matched 28-day post-vaccination samples for each of the above-described 39 pre-vaccination samples from the cohort of adults based in the USA (Table 1). We also analyzed a smaller set of matched pre- and post-vaccination sera samples from a cohort of eight adults based in Australia (Table 1). Note that there are several differences between these cohorts: the USA-based cohort received the 2023-2024 Northern Hemisphere egg-grown vaccine whereas the Australia-based cohort received the 2024 Southern Hemisphere cell-grown vaccine, and most individuals in the USA-based cohort had also been vaccinated in the prior season whereas most individuals in the Australia-based cohort had not. Therefore, multiple factors could contribute to observed differences in vaccine response between the cohorts.

      Reviewer #3 (Recommendations for the authors):

      Main comments:

      (1) The authors compare titres of the pooled sera with the median titres across individual sera, finding a weak correlation (Figure 4). I was therefore interested in the finding that geometric mean titre and median across a study population are well correlated with growth rate (Supplemental Figure 6c). It would be useful to have some more discussion on why estimates from a pool are so much worse than pooled estimates.

      We thank this reviewer for this point. We would clarify that pooling sera is the equivalent of taking the arithmetic mean of the individual sera, rather than the geometric mean or median, which tends to bias the measurements of the pool to the outliers within the pool. To address this reviewer’s point, we’ve added the following text to the manuscript:

      “To confirm that sera pools are not reflective of the full heterogeneity of their constituent sera, we created equal volume pools of the children and adult sera and measured the titers of these pools using the sequencing-based neutralization assay. As expected, neutralization titers of the pooled sera were always higher than the median across the individual constituent sera, and the pool titers against different viral strains were only modestly correlated with the median titers across individual sera (Figure 4). The differences in titers across strains were also compressed in the serum pools relative to the median across individual sera (Figure 4). The failure of the serum pools to capture the median titers of all the individual sera is especially dramatic for the children sera (Figure 4) because these sera are so heterogeneous in their individual titers (Figure 3b). Taken together, these results show that serum pools do not fully represent individual-level heterogeneity, and are similar to taking the arithmetic mean of the titers for a pool of individuals, which tends to be biased by the highest titer sera”.

      (2) Perhaps I missed it, but are growth rates weekly growth rates? (I assume so?)

      The growth rates are relative exponential growth rates calculated assuming a serial interval of 3.6 days. We also added clarifying language and a citation for the serial growth interval to the methods section:

      The analysis performing H3 HA strain growth rate estimates using the evofr[51] package is at https://github.com/jbloomlab/flu_H3_2023_seqneut_vs_growth. Briefly, we sought to make growth rate estimates for the strains in 2023 since this was the same timeframe when the sera were collected. To achieve this, we downloaded all publicly-available H3N2 sequences from the GISAID[88] EpiFlu database, filtering to only those sequences that closely matched a library HA1 sequence (within one HA1 amino-acid mutation) and were collected between January 2023 and December 2023. If a sequence was within one HA1 amino-acid mutation of multiple library HA1 proteins then it was assigned to the closest one; if there were multiple equally close matches then it was assigned fractionally to each match. We only made growth rate estimates for library strains with at least 80 sequencing counts (Supplemental Figure 9a), and ignored counts for sequences that did not match a library strain (equivalent results were obtained if we instead fit a growth rate for these sequences as an “other” category). We then fit multinomial logistic regression models using the evofr[51] package assuming a serial interval of 3.6 days[101]  to the strain counts. For the plot in Figure 5a the frequencies are averaged over a 14-day sliding window for visual clarity, but the fits were to the raw sequencing counts. For most of the analyses in this paper we used models based on requiring 80 sequencing counts to make an estimate for strain growth rates, and counting a sequence as a match if it was within one amino-acid mutation; see https://jbloomlab.github.io/flu_H3_2023_seqneut_vs_growth/ for comparable analyses using different reasonable sequence count cutoffs (e.g., 60, 50, 40 and 30, as depicted in Supplemental Figure 9).  Across sequence cutoffs, we found that the fraction of individuals with low neutralization titers and number of HA1 mutations correlated strongly with these MLR-estimated strain growth rates.

      (3)  I found Figure 3 useful in that it presents phylogenetic structure alongside titres, to make it clearer why certain clusters of strains have a lower response. In contrast, I found it harder to meaningfully interpret Figure 7a beyond the conclusion that vaccines lead to a fairly uniform rise in titre. Do the 275 or 276 mutations that seem important for adults in Figure 3 have any impact?

      We are certainly interested in the questions this reviewer raises, and in trying to understand how well a seasonal vaccine protects against the most successful influenza variants that season. However, these post-vaccination sera were taken when neutralizing titers peak ~30 days after vaccination. Because of this, in the larger cohort of US-based post-vaccination adults, the median titers across sera to most strains appear uniformly high. In the Australian-based post-vaccination adults, there was some strain-to-strain variation in median titers across sera, but of course this must be caveated with the much smaller sample size. It might be more relevant to answer this question with longitudinally sampled sera, when titers begin to wane in the following months.

      (4)  It could be useful to define a mechanistic relationship about how you would expect susceptibility (e.g. fraction with titre < X, where X is a good correlate) to relate to growth via the reproduction number: R = R0 x S. For example, under the assumption the generation interval G is the same for all, we have R = exp(r*G), which would make it possible to make a prediction about how much we would expect the growth rate to change between S = 0.45 and 0.6, as in Fig 5c. This sort of brief calculation (or at least some discussion) could add some more theoretical underpinning to the analysis, and help others build on the work in settings with different fractions with low titres. It would also provide some intuition into whether we would expect relationships to be linear.

      This is an interesting idea for future work! However, the scope of our current study is to provide these experimental data and show a correlation with growth; we hope this can be used to build more mechanistic models in future.

      (5) A key conclusion from the analysis is that the fraction above a threshold of ~140 is particularly informative for growth rate prediction, so would it be worth including this in Figure 6-7 to give a clearer indication of how much vaccination reduces contribution to strain growth among those who are vaccinated? This could also help link these figures more clearly with the main analysis and question.

      Although our data do find ~140 to be the threshold that gives max correlation with growth rate, we are not comfortable strongly concluding 140 is a correlate of protection, as titers could influence viral fitness without completely protecting against infection. In addition, inspection of Figure 5d shows that while ~140 does give the maximal correlation, a good correlation is observed for most cutoffs in the range from ~40 to 200, so we are not sure how robustly we can be sure that ~140 is the optimal threshold.

      (6)  In Figure 5, the caption doesn't seem to include a description for (e).

      Thank you to the reviewer for catching this – this is fixed now.

      (7)  The US vs Australia comparison could have benefited from more motivation. The authors conclude ,"Due to the multiple differences between cohorts we are unable to confidently ascribe a cause to these differences in magnitude of vaccine response" - given the small sample sizes, what hypotheses could have been tested with these data? The comparison isn't covered in the Discussion, so it seems a bit tangential currently.

      Thank you to the reviewer for this comment, but we should clarify our aim was not to directly compare US and Australian adults. We are interested in regional comparisons between serum cohorts, but did not have the numbers to adequately address those questions here. This section (and the preceding question) were indeed both intended to be tangential to the main finding, and hopefully this will be clarified with our text additions in response to Reviewer #3’s public reviews.

    1. eLife Assessment

      This is a useful study that examines the relationship between neuropeptide signaling and the precision of vocal motor output using the songbird as a model system. The study presents evidence based on differential expression patterns and genetic or pharmacological inhibition of various neuropeptide genes for a causal role in song performance; however, this evidence is incomplete.

    2. Reviewer #1 (Public review):

      Summary:

      This study provides evidence that neuropeptide signaling, particularly via the CRH-CRHBP pathway, plays a key role in regulating the precision of vocal motor output in songbirds. By integrating gene expression profiling with targeted manipulations in the song vocal motor nucleus RA, the authors demonstrate that altering CRH and CRHBP levels bidirectionally modulate song variability. These findings reveal a previously unrecognized neuropeptidergic mechanism underlying motor performance control, supported by molecular and functional evidence.

      Strengths:

      Neural circuit mechanisms underlying motor variability have been intensively studied, yet the molecular bases of such variability remain poorly understood. The authors address this important gap using the songbird (Bengalese finch) as a model system for motor learning, providing experimental evidence that neuropeptide signaling contributes to vocal motor variability. They comprehensively characterize the expression patterns of neuropeptide-related genes in brain regions involved in song vocal learning and production, revealing distinct regulatory profiles compared to non-vocal related regions, as well as developmental, revealing distinct regulatory profiles compared to non-vocal regions, as well as developmental and behavioral dependencies, including altered expression following deafening and correlations with singing activity over the two days preceding sampling. Through these multi-level analyses spanning anatomy, development, and behavior, the authors identify the CRH-CRHBP pathway in the vocal motor nucleus RA as a candidate regulator of song variability. Functional manipulations further demonstrate that modulation of this pathway bidirectionally alters song variability.

      Overall, this work represents an effective use of songbirds, though a well-established neuroethological framework uncovers how previously uncharacterized molecular pathways shape behavioral output at the individual level.

      Weaknesses:

      (1) This study uses Bengalese finches (BFs) for all experiments-bulk RNA-seq, in situ hybridization across developmental stages, deafening, gene manipulation, and CRH microinfusion-except for the sc/snRNA-seq analysis. BFs differ from zebra finches (ZFs) in several important ways, including faster song degradation after deafening and greater syllable sequence complexity. This study makes effective use of these unique BF characteristics and should be commended for doing so.

      However, the major concern lies in the use of the single-cell/single-nucleus RNA-seq dataset from Colquitt et al. (2021), which combines data from both ZFs and BFs for cell-type classification. Based on our reanalysis of the publicly available dataset used in both Colquitt et al. (2021) and the present study, my lab identified two major issues:

      (a) The first concern is that the quality of the single-cell RNA-seq data from BFs is extremely poor, and the number of BF-derived cells is very limited. In other words, most of the gene expression information at the single-cell (or "subcellular type") level in this study likely reflects ZF rather than BF profiles. In our verification of the authors' publicly annotated data, we found that in the song nucleus RA, only about 18 glutamatergic cells (2.3%) of a total of 787 RA_Glut (RA_Glut1+2+3) cells were derived from BFs. Similarly, in HVC, only 53 cells (4.1%) out of 1,278 Glut1+Glut4 cells were BF-derived. This clearly indicates that the cell-subtype-level expression data discussed in this study are predominantly based on ZF, not BF, expression profiles.

      Recent studies have begun to report interspecies differences in the expression of many genes in the song control nuclei. It is therefore highly plausible that the expression patterns of CRHBP and other neuropeptide-signaling-related genes differ between ZFs and BFs. Yet, the current study does not appear to take this potential species difference into account. As a result, analyses such as the CellChat results (Fig. 2F and G) and the model proposed in Fig. 6G are based on ZF-derived transcriptomic information, even though the rest of the experimental data are derived from BF, which raises a critical methodological inconsistency.

      (b) The second major concern involves the definition of "subcellular types" in the sc/snRNA-seq dataset. Specifically, the RA_Glut1, 2, and 3 and HVC_Glu1 and 4 clusters-classified as glutamatergic projection neuron subtypes-may in fact represent inter-individual variation within the same cell type rather than true subtypes. Following Colquitt et al. (2021), Toji et al. (PNAS, 2024) demonstrated clear individual differences in the gene expression profiles of glutamatergic projection neurons in RA.

      In our reanalysis of the same dataset, we also observed multiple clusters representing the same glutamatergic projection neurons in UMAP space. This likely occurs because Seurat integration (anchor-based mutual nearest neighbor integration) was not applied, and because cells were not classified based on individual SNP information using tools such as Souporcell. When classified by individual SNPs, we confirmed that the RA_Glut1-3 and HVC_Glu1 and 4 clusters correspond simply to cells from different individuals rather than distinct subcellular types. (Although images cannot be attached in this review system, we can provide our analysis results if necessary.)

      This distinction is crucial, as subsequent analyses and interpretations throughout the manuscript depend on this classification. In particular, Figure 6G presents a model based on this questionable subcellular classification. Similarly, the ligand-receptor relationships shown in Figure 2G - such as the absence of SST-SSTR1 signaling in RA_Glut3 but its presence in RA_Glut1 and 2-are more plausibly explained by inter-individual variation rather than subcellular-type specificity.

      Whether these differences are interpreted as individual variation within a single cell type or as differences in projection targets among glutamatergic neurons has major implications for understanding the biological meaning of neuropeptide-related gene expression in this system.

      (2) Based on the important finding that "CRHBP expression in the song motor pathway is correlated with singing," it is necessary to provide data showing that the observed changes in CRHBP and other neuropeptide-related gene expression during the song learning period or after deafening are not merely due to differences in singing amount over the two days preceding brain sampling.

      Without such data, the following statement cannot be justified: "Regarding CRHBP expression in the song motor pathway increases during song acquisition and decreases following deafening."

      (3) In Figure 5B, the authors should clearly distinguish between intact and deafened birds and show the singing amount for each group. In practice, deafening often leads to a reduction in both the number of song bouts and the total singing time. If, in this experiment, deafened birds also exhibited reduced singing compared to intact birds, then the decreased CRHBP expression observed in HVC and RA (Figures 3 and 4) may not reflect song deterioration, but rather a simple reduction in singing activity.

      As a similar viewpoint, the authors report that CRHBP expression levels in RA and HVC increase with age during the song learning period. However, this change may not be directly related to age or the decline in vocal plasticity. Instead, it could correlate with the singing amount during the one to two days preceding brain sampling. The authors should provide data on the singing activity of the birds used for in situ hybridization during the two days prior to sampling.

    3. Reviewer #2 (Public review):

      Summary:

      The results presented here are a useful extension of two of their previous papers (Colquitt et al 2021, Colquitt et al 2023), where they used single-cell transcriptomics to characterize the inhibitory and excitatory cell types and gene expression patterns of the song circuit, comparing them to mammalian and reptilian brains, and characterized the effect of deafening on these gene expression patterns. In this paper, they focus on the differential expression of various neuropeptidergic systems in the songbird brain. They discover a role for the CRHBP gene in song performance and causally show its influence on song variability.

      Strengths:

      The authors leverage the advantages of the 'nucleated' structure of the songbird neural circuitry and use a robust approach to compare neuropeptidergic gene expression patterns in these circuits. Their analysis of the expression patterns of the CRHBP gene in different cell types supports their conclusion that interneurons are particularly amenable to this modulation. Their use of a knockdown strategy along with pharmacological manipulation provides strong support for a causal role of neuropeptidergic modulation on song behaviour. These results have important implications as they bring into focus neuropeptide modulation of the song-motor circuit and pave the way for future studies focussing on how this signalling pathway regulates plasticity during song learning and maintenance.

      Weaknesses:

      While the results demonstrating the bidirectional modulation of CRH and CRHBP on song performance shed light on their role in song plasticity, it would be important to show this in juvenile finches during sensorimotor learning. We also don't get a clear picture of the 'causal' role of this signalling pathway on the song pre-motor area, HVC, as the knockdown and pharmacological manipulation studies were done in RA, whereas we see a modulation of CRHBP expression during deafening and song learning in both RA and HVC. Given the role of interneurons in the HVC in song acquisition (e.g., Vallentin et al. 2016, Science), it would have been interesting to see the results of HVC-specific manipulation of this neuropeptidergic pathway and/or how it affects the song learning process. Perhaps a short discussion of this would help to give the readers some perspective. Finally, a more direct demonstration of the neurophysiological effect of the signalling pathway would also strengthen our understanding of precisely how these modulate the song circuit plasticity, which I understand might be beyond the scope of this study.

      Technical/minor:

      In the Methods section, several clarifications would be beneficial. For instance, the description of the design matrices would benefit from being presented in a more general statistical form (e.g., linear model equations) rather than using R syntax. This would make the modeling approach more accessible to readers unfamiliar with software-specific syntax. In addition, while some variables (e.g., cdr_scale, frac_mito_scale) are briefly defined, others (e.g., tags, cut3,nsongs_last_two_days_cut3) could be more clearly described. This applies to the descriptions of both the gene set enrichment analysis and the neuropeptide-receptor analysis, which rely heavily on package-specific terminology (e.g., fgseaMultilevel, computeCommunProb), making it difficult for readers to understand the conceptual or statistical basis of the analyses. It would improve clarity if the authors provided a complete list of variable definitions, types (categorical or continuous), and any scaling/transformations applied would enhance clarity and reproducibility.

    4. Reviewer #3 (Public review):

      Summary:

      The stable production of learned vocalizations like human language and birdsong requires auditory feedback. What happens in the brain areas that generate stable vocalizations as performance deteriorates is not well understood. Using a species of songbird, the current study investigates individual cells within the evolutionarily-conserved brain regions that generate learned vocalizations to describe that the complement of neuropeptide (short proteins) signals may be a key feature of behavioral change. Because neuropeptides are important across species, these findings may help explain diminishing stability in learned behaviors even in humans.

      Strengths:

      The experiments are solid and follow a strong progression from description through manipulation. The songbird model is appropriate and powerful to inform on generalizable biological mechanisms of precisely learned behaviors, including human speech.

      Weaknesses:

      While it is always possible to perform more experiments, most of the weaknesses are in the presentation of the project, not in the evidence or analysis, which are leading-edge and appropriate. Generally, the ability to follow the findings and to independently assess rigor would be enhanced with increased explicit mention of the statistical thresholds and subjective descriptions. In addition, two prior pieces of relevant work seem to be omitted, including one performing deafening, gene expression measures, and behavioral assessment in zebra finches, and another describing neuropeptide complements in zebra finch singing nuclei based largely on mass spectrometry. The former in particular should be related to the current findings.

    5. Author response:

      We thank the reviewers for their time and their constructive comments.

      Reviewer 1 makes several incisive comments about the single-cell RNA-sequencing dataset used in this  version of the manuscript, which was previously published in Colquitt, 2021. The Reviewer correctly  notes that this dataset consists primarily of nuclei from zebra finches, with a relatively small proportion of  the data coming from Bengalese finches. However, all other data presented here comes from assays and  experiments in Bengalese finches. This discrepancy could lead to two issues of interpretation. First, there  could be substantive expression differences in the CRH signaling pathway between these two species,  making it difficult to interpret its cellular expression profile. Second, the Reviewer describes that in their  reanalysis of this dataset they determined that what had been described as distinct cell types – namely  HVC-Glut-1 vs. HVC-Glut-4 (corresponding to the HVC  RA  projection neurons) and the three RA-Glut  types – are likely to be single cell types. The Reviewer notes that inter-individual differences in gene  expression, which were not analyzed in the original publication, could have generated this apparent cell  type diversity.

      To the first point, we agree that the use of the published dataset that consists primarily of zebra finch  data is not ideal when making claims of cell type-specific expression in Bengalese finches. To rectify this  issue, we have generated additional sets of snRNA-seq from Bengalese finches that encompass multiple  areas of the song system as well as adjacent comparator regions outside of the principal song areas.  Our initial analysis of these datasets indicates that the cellular patterns of expression of the CRH system  is consistent with what has been presented here. In our revision, we will include a reanalysis of  neuropeptide expression using these more extensive datasets.

      To the second point, we also agree that some of the instances of glutamatergic neuron diversity could  have been generated either by issues stemming from the integration of two species or through  interindividual differences. In our analysis of our newer snRNA-seq data, we also identify a single HVC  RA  projection neuron type (not two) and that RA projection neuron types fall into one or two classes (not  three), similar to what Reviewer 1 described. We have deconvolved these datasets by genotype, as  suggested by the Reviewer, and do not see substantial interindividual variation across the CRH system.  However, our revision will explicitly address these issues.

      Reviewer 1 also brings up several important questions concerning the relationships between CRHBP  and singing and the challenge of interpreting the influences of song acquisition and deafening on CRHBP  expression, given the variation in singing that generally accompanies these changes to song. To address  in part this issue, our regression analysis of deafening-associated gene expression differences includes  a term for the number of songs sung on the day of euthanasia as well as an interaction term between  song destabilization and singing amount. This design controls for the amount that a bird sang in the  period before brain collection. This analysis was included in  (Colquitt et al., 2023) , and will be further  elaborated and discussed in the revised version of this manuscript. Notably, CRHBP expression shows a  significant interaction between song destabilization and singing amount, suggesting that reduction of  CRHBP following deafening is greater than what would be expected from any reductions in singing  alone. This specific analysis will be included in the revised manuscript as well.

      However, despite these statistical controls, we cannot fully rule out that singing is playing a fundamental  role in driving the CRHBP expression differences we see across conditions. Indeed, a number of studies  have described an association between the amount a bird sings and the variability of its song  (Chen et  al., 2013; Hayase et al., 2018; Hilliard et al., 2012; Miller et al., 2010; Ohgushi et al., 2015) , with a general trend of higher amounts of singing correlated with a reduction in variability. This relationship is  consistent with what we see for CRHBP expression in RA and HVC: high in unmanipulated adult males  and decreased during states of high variability and plasticity (post-deafening and juveniles). A model that  combines these observations, and that we will include in the Discussion of the revised manuscript, is one  in which singing induces the expression of CRHBP in RA and HVC, limiting CRH binding to its receptors,  thereby limiting this pathway’s proposed effects on the excitability and synaptic plasticity of projection  neurons.

      Reviewer 2 suggests multiple interesting avenues to more fully characterize the role of the CRH pathway  in song performance and learning. First, we agree that HVC is a compelling target to investigate CRH’s  role in song, given the similarity of CRHBP expression in HVC and RA across deafening, song  acquisition, and singing. As the Reviewer notes, a number of studies have demonstrated key roles for  interneurons in shaping neuronal dynamics in HVC and regulating song structure. Here, we focused on  RA due to the direct influence of RA projection neurons have on syringeal and respiration motoneurons  controlling song production, and the following expectation that manipulations of CRH signaling in this  region would have particularly measurable effects on song.  However, we agree with the reviewer that it  would be of additional interest to investigate manipulations of CRH signalling in HVC.  We are  considering whether it will be feasible given the usual constraints of time, personnel, and other  competing demands to carry such experiments as an addition to the current manuscript. Depending on  how that goes, we will either add new experimental data to the manuscript, or simply acknowledge the  interest of such experiments in Discussion and defer their pursuit to future study.

      Likewise, Reviewer 2 suggests other ways in which an understanding of the role of CRH signalling could  be further enriched with additional experiments, including investigating the influence of CRH signaling on  song acquisition, when song transitions from a variable and plastic state to a precise and stereotyping  state, and pursuing direct evidence that CRH influences the neurophysiology of glutamatergic neurons in  HVC or RA. These are both excellent suggestions for ways in neuropeptide signalling could be further  linked to alterations in behavior; As we proceed with revisions we will consider whether we can address  some of these suggestions within the scope of the current manuscript, versus note them in discussion as  directions for future research.

      Chen Q, Heston JB, Burkett ZD, White SA. 2013. Expression analysis of the speech-related genes  FoxP1 and FoxP2 and their relation to singing behavior in two songbird species.  J Exp Biol  216 :3682–3692. doi:10.1242/jeb.085886

      Colquitt BM, Li K, Green F, Veline R, Brainard MS. 2023. Neural circuit-wide analysis of changes to gene  expression during deafening-induced birdsong destabilization.  Elife  12 :e85970. doi:10.7554/eLife.85970

      Hayase S, Wang H, Ohgushi E, Kobayashi M, Mori C, Horita H, Mineta K, Liu W-C, Wada K. 2018. Vocal  practice regulates singing activity-dependent genes underlying age-independent vocal learning in  songbirds.  PLoS Biol 16 :e2006537. doi:10.1371/journal.pbio.2006537

      Hilliard AT, Miller JE, Fraley ER, Horvath S, White SA. 2012. Molecular microcircuitry underlies functional  specification in a basal ganglia circuit dedicated to vocal learning.  Neuron  73 :537–552.  doi:10.1016/j.neuron.2012.01.005

      Miller JE, Hilliard AT, White SA. 2010. Song practice promotes acute vocal variability at a key stage of  sensorimotor learning.  PLoS One  5 :e8592. doi:10.1371/journal.pone.0008592

      Ohgushi E, Mori C, Wada K. 2015. Diurnal oscillation of vocal development associated with clustered  singing by juvenile songbirds.  J Exp Biol  218 :2260–2268.  doi:10.1242/jeb.115105

    1. eLife Assessment

      The authors aim to understand why Kupffer cells (KCs) die in metabolic-associated steatotic liver disease (MASLD). This is a useful study using in vitro studies and an in vivo genetic mouse model, suggesting that increased glycolysis contributes to KC death in MASLD. However, the data presented are incomplete as some inconsistencies in the results presented are identified in the characterisation of KCs. This work will be of interest to researchers in the immunology and metabolism fields.

    2. Reviewer #1 (Public review):

      Summary:

      The authors aim to investigate the mechanisms underlying Kupffer cell death in metabolic-associated steatotic liver disease (MASLD). The authors propose that KCs undergo massive cell death in MASLD and that glycolysis drives this process. However, there appears to be a discrepancy between the reported high rates of KC death and the apparent maintenance of KC homeostasis and replacement capacity.

      Strengths:

      This is an in vivo study.

      Weaknesses:

      There are discrepancies between the authors' observations and previous reports, as well as inconsistencies among their own findings.

      Before presenting the percentage of CLEC4F⁺TUNEL⁺ cells, the authors should have first shown the number of CLEC4F⁺ cells per unit area in Figure 1. At 16 weeks of age, the proportion of TUNEL⁺ KCs is extremely high (~60%), yet the flow cytometry data indicate that nearly all F4/80⁺ KCs are TIMD4⁺, suggesting an embryonic origin. If such extensive KC death occurred, the proportion of embryonically derived TIMD4⁺ KCs would be expected to decrease substantially. Surprisingly, the proportion of TIMD4⁺ KCs is comparable between chow-fed and 16-week HFHC-fed animals. Thus, the immunostaining and flow cytometry data are inconsistent, making it difficult to explain how massive KC death does not lead to their replacement by monocyte-derived cells.

      These data suggest that despite the reported high rate of cell death among CLEC4F⁺TIMD4⁺ KCs, the population appears to self-maintain, with no evidence of monocyte-derived KC generation in this model, which contradicts several recent studies in the field.

      Moreover, there is no evidence that TIMD4⁺CLEC4F⁺ KCs increase their proliferation rate to compensate for such extensive cell death. If approximately 60% of KCs are dying and no monocyte-derived KCs are recruited, one would expect a much greater decrease in total KC numbers than what is reported.

      It is also unexpected that the maximal rate of KC death occurs at early time points (8 weeks), when the mice have not yet gained substantial weight (Figure 1B). Previous studies have shown that longer feeding periods are typically required to observe the loss of embryo-derived KCs.

      Furthermore, it is surprising that the HFD induces as much KC death as the HFHC and MCD diets. Earlier studies suggested that HFD alone is far less effective than MASH-inducing diets at promoting the replacement of embryonic KCs by monocyte-derived macrophages.

      In Figure 2D, TIMD4 staining appears extremely faint, making the results difficult to interpret. In contrast, the TUNEL signal is strikingly intense and encompasses a large proportion of liver cells (approximately 60% of KCs, 15% of hepatocytes, 20% of hepatic stellate cells, 30% of non-KC macrophages, and a proportion of endothelial cells is also likely affected). This pattern closely resembles that typically observed in mouse models of acute liver failure. Given this apparent extent of cell death, it is unexpected that ALT and AST levels remain low in MASH mice, which is highly unusual.

      No statistical analysis is provided for Figure 5D, and it is unclear which metabolites show statistically significant changes in Figure 5C.

      In addition, there is no evaluation of liver pathology in Clec4f-Cre × Chil1flox/flox mice. It remains possible that the observed effects on KC death result from aggravated liver injury in these animals. There is also no evidence that Chil1 deficiency affects glucose metabolism in KCs in vivo.

      Finally, the authors should include a more direct experimental approach to modulate glycolysis in KCs and assess its causal role in KC death in MASH.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, He et al. set out to investigate the mechanisms behind Kupffer Cell death in MASLD. As has been previously shown, they demonstrate a loss of resident KCs in MASLD in different mouse models. They then go on to show that this correlates with alterations in genes/metabolites associated with glucose metabolism in KCs. To investigate the role of glucose metabolism further, they subject isolated KCs in vitro to different metabolic treatments and assess cleaved caspase 3 staining, demonstrating that KCs show increased Cl. Casp 3 staining upon stimulation of glycolysis. Finally, they use a genetic mouse model (Chil1KO) where they have previously reported that loss of this gene leads to increased glycolysis and validate this finding in BMDMs (KO). They then remove this gene specifically from KCs (Clec4fCre) and show that this leads to increased macrophage death compared with controls.

      Strengths:

      As we do not yet understand why KCs die in MASLD, this manuscript provides some explanation for this finding. The metabolomics is novel and provides insight into KC biology. It could also lead to further investigation; here, it will be important that the full dataset is made available.

      Weaknesses:

      Different diets are known to induce different amounts of KC loss, yet here, all models examined appear to result in 60% KC death. One small field of view of liver tissue is shown as representative to make these claims, but this is not sufficient, as anything can be claimed based on one field of view. Rather, a full tissue slice should be included to allow readers to really assess the level of death. Additionally, there is no consistency between the markers used to define KCs and moMFs, with CLEC4F being used in microscopy, TIM4 in flow, while the authors themselves acknowledge that moKCs are CLEC4F+TIM4-. As moKCs are induced in MASLD, this limits interpretation. Additionally, Iba1 is referred to as a moMF marker but is also expressed by KCs, which again prevents an accurate interpretation of the data. Indeed, the authors show 60% of KCs are dying but only 30% of IBA1+ moMFs, as KCs are also IBA1+, this would mean that KCs die much more than moMFs, which would then limit the relevance of the BMDM studies performed if the phenotype is KC specific. Therefore, this needs to be clarified. The claim that periportal KCs die preferentially is not supported, given that the majority of KCs are peri-portal. Rather, these results would need to be normalised to KC numbers in PP vs PC regions to make meaningful conclusions. Additionally, KCs are known to be notoriously difficult to keep alive in vitro, and for these studies, the authors only examine cl. Casp 3 staining. To fully understand that data, a full analysis of the viability of the cells and whether they retain the KC phenotype in all conditions is required. Finally, in the Cre-driven KO model, there does not seem to be any death of KCs in the controls (rather numbers trend towards an increase with time on diet, Figure 6E), contrary to what had been claimed in the rest of the paper, again making it difficult to interpret the overall results. Additionally, there is no validation that the increased death observed in vivo in KCs is due to further promotion of glycolysis.

    4. Reviewer #3 (Public review):

      This manuscript provides novel insights into altered glucose metabolism and KC status during early MASLD. The authors propose that hyperactivated glycolysis drives a spatially patterned KC depletion that is more pronounced than the loss of hepatocytes or hepatic stellate cells. This concept significantly enhances our understanding of early MASLD progression and KC metabolic phenotype.

      Through a combination of TUNEL staining and MS-based metabolomic analyses of KCs from HFHC-fed mice, the authors show increased KC apoptosis alongside dysregulation of glycolysis and the pentose phosphate pathway. Using in vitro culture systems and KC-specific ablation of Chil1, a regulator of glycolytic flux, they further show that elevated glycolysis can promote KC apoptosis.

      However, it remains unclear whether the observed metabolic dysregulation directly causes KC death or whether secondary factors, such as low-grade inflammation or macrophage activation, also contribute significantly. Nonetheless, the results, particularly those derived from the Chil1-ablated model, point to a new potential target for the early prevention of KC death during MASLD progression.

      The manuscript is clearly written and thoughtfully addresses key limitations in the field, especially the focus on glycolytic intermediates rather than fatty acid oxidation. The authors acknowledge the missing mechanistic link between increased glycolysis and KC death. Still, several interpretations require moderation to avoid overstatement, and certain experimental details, particularly those concerning flow cytometry and population gating, need further clarification.

      Strengths:

      (1) The study presents the novel observation of profound metabolic dysregulation in KCs during early MASLD and identifies these cells as undergoing apoptosis. The finding that Chil1 ablation aggravates this phenotype opens new avenues for exploring therapeutic strategies to mitigate or reverse MASLD progression.

      (2) The authors provide a comprehensive metabolic profile of KCs following HFHC diet exposure, including quantification of individual metabolites. They further delineate alterations in glycolysis and the pentose phosphate pathway in Chil1-deficient cells, substantiating enhanced glycolytic flux through 13C-glucose tracing experiments.

      (3) The data underscore the critical importance of maintaining balanced glucose metabolism in both in vitro and in vivo contexts to prevent KC apoptosis, emphasizing the high metabolic specialization of these cells.

      (4) The observed increase in KC death in Chil1-deficient KCs demonstrates their dependence on tightly regulated glycolysis, particularly under pathological conditions such as early MASLD.

      Weaknesses:

      (1) The novelty is questionable. The presented work has considerable overlap with a study by the same lab, which is currently under review (citation 17), and it should be considered whether the data should not be presented in one paper.

      (2) The authors report that 60% of KCs are TUNEL-positive after 16 weeks of HFHC diet and confirm this by cleaved caspase-3 staining. Given that such marker positivity typically indicates imminent cell death within hours, it is unexpected that more extensive KC depletion or monocyte infiltration is not observed. Since Timd4 expression on monocyte-derived macrophages takes roughly one month to establish, the authors should consider whether these TUNEL-positive KCs persist in a pre-apoptotic state longer than anticipated. Alternatively, fate-mapping experiments could clarify the dynamics of KC death and replacement.

      (3) The mechanistic link between elevated glycolytic flux and KC death remains unclear.

      (4) The study does not address the polarization or ontogeny of KCs during early MASLD. Given that pro-inflammatory macrophages preferentially utilize glycolysis, such data could provide valuable insight into the reason for increased KC death beyond the presented hyperreliance on glycolysis.

      (5) The gating strategy for monocyte-derived macrophages (moMFs) appears suboptimal and may include monocytes. A more rigorous characterization of myeloid populations by including additional markers would strengthen the study's conclusions.

      (6) While BMDMs from Chil1 knockout mice are used to demonstrate enhanced glycolytic flux, it remains unclear whether Chil1 deficiency affects macrophage differentiation itself.

      (7) The authors use the PDK activator PS48 and the ATP synthase inhibitor oligomycin to argue that increased glycolytic flux at the expense of OXPHOS promotes KC death. However, given the high energy demands of KCs and the fact that OXPHOS yields 15-16 times more ATP per glucose molecule than glycolysis, the increased apoptosis observed in Figure 4C-F could primarily reflect energy deprivation rather than a glycolysis-specific mechanism.

      (8) In Figure 1C, KC numbers are significantly reduced after 4 and 16 weeks of HFHC diet in WT male mice, yet no comparable reduction is seen in Clec4Cre control mice, which should theoretically exhibit similar behavior under identical conditions.

    1. eLife Assessment

      This study examines the role of the fungal pathogen Candida albicans in the progression of colorectal cancer, a relevant and urgent topic given the global incidence of colon cancer. While the findings are useful and provide solid experimental work and insight into how Candida may contribute to tumor progression, the small patient sample size, reliance on in vitro models, and absence of in vivo validation may limit its impact. This work will interest scientists studying cancer progression and the role played by pathogens.

    2. Reviewer #1 (Public review):

      Summary:

      This study addresses the emerging role of fungal pathogens in colorectal cancer and provides mechanistic insights into how Candida albicans may influence tumor-promoting pathways. While the work is potentially impactful and the experiments are carefully executed, the strength of evidence is limited by reliance on in vitro models, small patient sample size, and the absence of in vivo validation, which reduces the translational significance of the findings.

      Strengths:

      (1) Comprehensive mechanistic dissection of intracellular signaling pathways.

      (2) Broad use of pharmacological inhibitors and cell line models.

      (3) Inclusion of patient-derived organoids, which increases relevance to human disease.

      (4) Focus on an emerging and underexplored aspect of the tumor microenvironment, namely fungal pathogens.

      Weaknesses:

      (1) Clinical association data are inconsistent and based on very small sample numbers.

      (2) No in vivo validation, which limits the translational significance.

      (3) Species- and cell type-specificity claims are not well supported by the presented controls.

      (4) Reliance on colorectal cancer cell lines alone makes it difficult to judge whether findings are specific or general epithelial responses.

    3. Reviewer #2 (Public review):

      The authors in this manuscript studied the role of Candida albicans in Colorectal cancer progression. The authors have undertaken a thorough investigation and used several methods to investigate the role of Candida albicans in Colorectal cancer progression. The topic is highly relevant, given the increasing burden of colon cancer globally and the urgent need for innovative treatment options.

      However, there are some inconsistencies in the figures and some missing details in the figures, including:

      (1) The authors should clearly explain in the results section which patient samples are shown in Figure 1B.

      (2) What do a, ab, b, b written above the bars in Figure 1F represent? Maybe authors should consider removing them, because they create confusion. Also, there is no explanation for those letters in the figure legend.

      (3) The authors should submit all the raw images of Western blot with appropriate labels to indicate the bands of protein of interest along with molecular weight markers.

      (4) The authors should do the quantification of data in Figure 2d and include it in the figure.

      (5) In Figure 2h, the authors should indicate if the quantification represents VEGF expression after 6h or 12h of C. albicans co-culture with cells.

      (6) In Figure 2i, quantification of VEGF should be done and data from three independent experiments should be submitted. The authors should also mention the time point.

    1. eLife Assessment

      M proteins are essential group A streptococci virulence factors that bind to numerous human proteins; a small subset of M proteins, such as M3, have been reported to bind collagen, which is thought to promote tissue adherence. In this paper, the authors provide a solid and important characterization of M3 interactions with collagen. The author's work raises important questions regarding the specificity of the structure and its interactions with different collagens with implications for the variable actions of M protein collagen interactions on biofilm formation.

    2. Reviewer #1 (Public review):

      Summary:

      Wojnowska et al. report structural and functional studies of the interaction of Streptococcus pyogenes M3 protein with collagen. They show through X-ray crystallographic studies that the N-terminal hypervariable region of M3 protein forms a T-like structure, and that the T-like structure binds a three-stranded collagen-mimetic peptide. They indicate that the T-like structure is predicted by AlphaFold3 with moderate confidence level in other M proteins that have sequence similarity to M3 protein and M-like proteins from group C and G streptococci. For some, but not all, of these related M and M-like proteins, AlphaFold3 predicts, with moderate confidence level, complexes similar to the one observed for M3-collagen. Functionally, the authors show that emm3 strains form biofilms with more mass when surfaces are coated with collagen, and this effect can be blocked by an M3 protein fragment that contains the T-structure. They also show the co-occurrence of emm3 strains and collagen in patient biopsies and a skin tissue organoid. Puzzlingly, M1 protein has been reported to bind collagen, but collagen inhibits biofilm in a particular emm1 strain but that same emm1 strain colocalizes with collagen in a patient biopsy sample. The implications of the variable actions of collagen on biofilm formation are not clear.

      Strengths:

      The paper is well written and the results are presented in a logical fashion.

      Weaknesses:

      A major limitation of the paper is that it is almost entirely observational and lacks detailed molecular investigation. Insufficient details or controls are provided to establish the robustness of the data.

      Comments on revisions:

      The authors' response to this reviewer's Major issue #1 is inadequate. Their argument is essentially that if they denature the protein, then there is no activity. This does not address the specificity of the structure or its interactions.

      They went only part way to addressing this reviewer's Major issue #2. While Figure 8 - supplement 3 shows 1D NMR spectra for M3 protein (what temperature?), it does not establish that stability is unaltered (to a significant degree).

      This reviewer's Major issue #3 is one of the major reasons for considering this study to be observational. This reviewer agrees that structural biology is by its nature observational, but modern standards require validation of structural observations. The authors' response is that a mechanistic investigation involving mutant bacterial strains and validation involving mutated proteins is beyond their scope. Therefore, the study remains observational.

      Major issue 4 was addressed suitably, but brings up the problematic point that the emm1 2006 strain colocalizes quite well with collagen in a patient biopsy sample but not in other assays. This calls into question the overall interpretability of the patient biopsy data.

      The authors have not provided a point-by-point response. Issues that were indicated to be minor previously were deemed to be minor because this reviewer thought that they could easily be addressed in a revision. It appears that the authors have ignored many of these comments, and these issues are therefore now considered to be major issues. For example, no errors are given for Kd measurements, Table 2 is sloppy and lacks the requested information, negative controls are missing (Figure 10 - figure supplement 1), and there is no indication of how many independent times each experiment was done.

      And "C4-binding protein" should be corrected to "C4b-binding protein."

    3. Reviewer #2 (Public review):

      Streptococcus pyogenes, or group A streptococci (GAS) can cause diseases ranging skin and mucosal infections, plasma invasion, and post-infection autoimmune syndromes. M proteins are essential GAS virulence factors that include an N-terminal hypervariable region (HVR). M proteins are known to bind to numerous human proteins; a small subset of M proteins were reported to bind collagen, which is thought to promote tissue adherence. In this paper, authors characterize M3 interactions with collagen and its role in biofilm formation. Specifically, they screened different collagen type II and III variants for full-length M3 protein binding using an ELISA-like method, detecting anti-GST antibody signal. By statistical analysis, hydrophobic amino acids and hydroxyproline found to positively support binding, whereas acidic residues and proline negatively impacted binding. The authors applied X-ray crystallography to determine the structure of the N-terminal domain (42-151 amino acids) of M3 protein (M3-NTD). M3-NTD dimmer (PDB 8P6K) forms a T-shaped structure with three helices (H1, H2, H3), which are stabilized by a hydrophobic core, inter-chain salt bridges and hydrogen bonds on H1, H2 helices, and H3 coiled coil. The conserved Gly113 serves as the turning point between H2 and H3. The M3-NTD is co-crystalized with a 24-residue peptide, JDM238, to determine the structure of M3-collagen binding. The structure (PDB 8P6J) shows that two copies of collagen in parallel bind to H1 and H2 of M3-NTD. Among the residues involved binding, conserved Try96 is shown to play a critical role supported by structure and isothermal titration calorimetry (ITC). The authors also apply a crystal-violet assay and fluorescence microscopy to determine that M3 is involved in collagen type I binding, but not M1 or M28. Tissue biopsy staining indicates that M3 strains co-localize with collagen IV-containing tissue, while M1 strains do not. The authors provide generally compelling evidence to show that GAS M3 protein binds to collagen, and plays a critical role in forming biofilms, which contribute to disease pathology. This is a very well-executed study and a well-written report relevant to understanding GAS pathogenesis and approaches to combatting disease; data are also applicable to emerging human pathogen Streptococcus dysgalactiae. One caveat that was not entirely resolved is if/how different collagen types might impact M3 binding and function. Due to the technical constrains, the in vitro structure and other binding assays use type II collagen whereas in vivo, biofilm formation assays and tissue biopsy staining use type I and IV collagen; it was unclear if this difference is significant. One possibility is that M3 has an unbiased binding to all types of collagens, only the distribution of collagens leads to the finding that M3 binds to type IV (basement membrane) and type I (varies of tissue including skin), rather than type II (cartilage).

      Comments on revisions:

      We are glad to see that the authors addressed our prior comments on M3 binding to different types of collagens in discussion section; adding a prediction of M3 binding to type I collagen (Figure 8-figure supplement 1B and 1C) is helpful to fill in the gap. Although it would be nice to experimentally fill in the gap by putting all types of collagens into one experiment (For example, like Figure 9A, use different types of human collagens to test biofilm formation; or Figure 10, use different types of human collagens to compete for biofilm formation), this appears to be beyond the scope of this paper. Meanwhile, the changes they have made are constructive.

      The authors have addressed the majority of our prior comments.

    4. Author response:

      The following is the authors’ response to the current reviews.

      We thank the reviewers for their comments on the initial submission, which helped us improve and extend the paper. We would like to respond specifically to reviewer #1.

      We disagree with the broad criticism of this study as being “almost entirely observational” and lacking “detailed molecular investigation”. We report structures and binding data, show mechanistic detail, identify critical residues and structural features underlying biological activity, and present biologically meaningful data demonstrating a role of the interaction of the M3 protein with collagens. We disagree that insufficient details or controls are included. We agree that our report has limitations, such as an understanding of potential emm1 strain binding to collagen, which might play a role in host tissue colonization, but not in biofilm.

      In response to issues raised in the initial review, we conducted several new experiments for the revised manuscript. We believe these strengthen what we report. Firstly, as the reviewer suggested, we conducted a binding experiment where the tertiary fold of M3-NTD was disrupted to confirm the T-shaped fold is indeed required for binding to collagen, as might be expected based on the crystal structure of the complex. To achieve this, we did not, as the reviewer states, use denatured protein in the ITC binding experiment. Instead, we used a monomeric form of M3-NTD, which does not adopt a well-defined tertiary structure, but retains all residues in the context of alpha helices. Secondly, we added more evidence for the importance of structural features (amino acid side chains defining the collagen binding site) by analysing the role of Trp103. Together, we provide clear evidence for the specific role of the T-shaped fold of M3-NTD for collagen binding.

      Responding to a constructive criticism by reviewer #1 we characterised M3-NTD mutants to demonstrate conservation of overall structure. NMR is an exquisite tool for this as it is highly sensitive to structural changes. It is not clear why the reviewer suggested we should have measured the stability of the proteins, which is irrelevant here. What matters is that the fold is conserved between mutated variants at the chosen experimental temperature (now added to the Methods section), which NMR demonstrates.

      We added errors for the ITC-derived dissociation constants.

      In the submitted versions of the paper we did not include the negative control requested by reviewer #1 for experiments shown in Figure 10 - figure supplement 1B. In our view this does not add information supporting our findings. However, we have now added two negative controls, staining of emm1 and emm28 strains. As expected, no reactivity was found with the type-specific M3 HVR antiserum while the M3 BCW antiserum showed weak reactivity, in line with some sequence similarity of the C-terminal regions of M proteins.

      Table 2 contains essential information, in line with what generally is shown in crystallographic tables in this journal. All other information can be found in the depositions of our data at the PDB. The structures have been scrutinised and checked by the PDB and passed all quality tests.

      We stated how many times experiments were done where appropriate. We now added this information for CLC assays (as given in the previously published protocol, refs. 45, 47). ITC was carried out more than once for optimization but the results of single experiments are shown (as is common practice).


      The following is the authors’ response to the original reviews.

      Many thanks for assessing our submission. We are grateful for the reviews that have informed a revised version of the paper, which includes additional data and modified text to take into account the reviewers’ comments. 

      We addressed the major limitation identified by Reviewer #1 by including data to demonstrate that collagen binding is indeed dependent on the T-shaped fold (major issue 1). Reviewer #1 suggested this needs to be done through extensive mutational work. This in our view was neither feasible nor necessary. Instead, we used ITC to measure collagen peptide binding using a monomeric form of M3, which preserves all residues including the ones involved in binding, but cannot form the T-shaped structure. This achieves the same as unravelling the T fold through mutations, but without the risk of aJecting binding through altering residues that are involved in both binding and definition of the T fold. The experiment shows a very weak interaction, confirming the fold of the M3-NTD is required for binding activity.

      Reviewer #1 finds the study limited for being “almost entirely observational”. Structural biology is by its nature observational, which is not a limitation but the very purpose of this approach. Our study goes beyond observing structures. In the first version of our paper, we identified a critical residue within a previously mapped binding site, and demonstrated through mutagenesis a causal link between presence of this residue on a tertiary fold and collagen binding activity. However, we agree this analysis could have been strengthened by additional mutagenesis, which we carried out and describe in the revised manuscript. This identifies a second residue that is critical for collagen binding. We firmed up these mutational experiments with a characterisation of mutated forms of M3 by NMR spectroscopy to confirm that these mutations did not aJect the overall fold, addressing major issue no. 2 of reviewer #1. We further demonstrate that the interaction between M3 and collagen is the cause of greatly enhanced biofilm formation as observed in patient biopsies and a tissue model of infection. We show that other streptococci that do not possess a surface protein presenting collagen binding sites like M3 do not form collagen-dependent biofilm. We therefore do not think that criticising our study for being almost entirely observational is valid. 

      Major issue 3:

      We agree with the reviewer that it would be useful to carry out experiments with k.o. and complemented strains. Such experiments go beyond the scope of our study, but might be carried out by us or others in the future. We disagree that emm1 is used “as a negative”. Instead, we established that, in contrast to emm3 strains, emm1 strain biofilm formation is not enhanced by collagen. 

      We addressed major issue 4 by quantifying colocalizations in the patient biopsies and 3D tissue model experiments.

      We thank Reviewer #2 for the thorough analysis of our reported findings. The main criticism here (issue 1) concerns the question of whether binding of emm3 streptococci would diJer to diJerent types of collagen. Our collagen peptide binding assays together with the structural data identify the collagen triple helix as the binding site for M3. While collagen types diJer in their distribution, functions and morphology in diJerent tissues, they all have in common triple-helical (COL) regions with high sequence similarity that are non-specifically recognised by M3. Therefore, our data in conjunction with the body of published work showing binding to M3 to collagens I, II, III and IV suggest it is highly likely that emm3 streptococci will indeed bind to all types of collagen in the same manner. We added a statement to the manuscript to make this point more clearly. We also added a prediction of a complex between M3 and a collagen I triple-helical peptide, which supports the idea of conserved binding mechanism for all collagen types. Whether this means all collagen types in the various tissues where they occur are targeted by emm3 streptococci is a very interesting question, however one that goes beyond the scope of our study.

      Minor issues identified by the reviewers were addressed through changes in the text and addition of figures.

      Summary of changes:

      (1) Two new authors have been added due to inclusion of additional data and analysis.

      (2) New experimental data included in section "M3-NTD harbors the collagen binding site".

      (3) Figure 3 panels A and B assigned and swapped.

      (4) Figure 4 changed to include new data and move mutant M3-NTD ITC graphs to supplement.

      (5) Table 2 corrected and amended.

      (6) AlphaFold3 quality parameters ipTM and pTM added to all figures showing predicted structures.

      (7) New supplementary figure added showing crystal packing of M3-NTD/collagen peptide complex.

      (8) Figure supplement of predicted M-protein/collagen peptide complexes includes new panel for a type I collagen peptide bound to M3.

      (9) New figure supplement showing mutant M3-NTD ITC data.

      (10) New figure supplement showing 1D <sup>1</sup>H NMR spectra of M3-NTD mutants.

      (11) Included data for additional M3-NTD mutants assessing role of Trp103 in collagen binding. Text extended to describe and place into context findings from ITC binding studies using these mutants.

      (12) Added quantitative analysis of biopsy and tissue model data (Mander's overlap coeJicient).

      (13) Corrected and extended table 3 to take into account new primers.

      (14) Added experimental details for new NMR and ITC experiments as well as new quantitative image analysis.

      (15) Minor adjustments to the text to improve clarity and correct errors.

    1. eLife Assessment

      This is a valuable study describing transcriptome-based pheochromocytoma and paraganglioma (PPGL) subtypes and exploring the mutations, immune correlates and disease progression of cases in each subtype. The cohort is a reasonable size and a second cohort is included from the Cancer Genome Atlas (TCGA). One of the key premises of the study is that identification of driver mutations in PPGL is not complete and that compromises characterisation for prognostic purposes. This is a solid starting point on which to base characterisation using different methods.

    2. Reviewer #1 (Public review):

      This study presents an exploration of PPGL tumour bulk transcriptomics and identifies three clusters of samples (labeled as subtypes C1-C3). Each subtype is then investigated for the presence of somatic mutations, metabolism-associated pathway and inflammation correlates, and disease progression.

      The proposed subtype descriptions are presented as an exploratory study. The proposed potential biomarkers from this subtype are suitably caveated and will require further validation in PPGL cohorts together with mechanistic study.

      The first section uses WGCNA (a method to identify clusters of samples based on gene expression correlations) to discover three transcriptome-based clusters of PPGL tumours using a new cohort of n=87 PPGL samples from various locations in the body.

      The second section inspects a previously published snRNAseq dataset, assigning the published samples to subtypes C1-C3 using a pseudo-bulk approach.

      The tumour samples are obtained from multiple locations in the body, summarised in Fig1A. It will be important to see further investigation of how the sample origin is distributed among the C1-C3 clusters, and whether there is a sample-origin association with mutational drivers and disease progression.

      Comments on revisions:

      In SupplFile3 (pdf) - please correct the table format. The contents are obscured due to the narrowness of the table columns.

      Deposit the new RNAseq data (N=87 cases, N=5 controls) in an appropriate repository; see "Data on human genotypes and phenotypes" at https://elife-rp.msubmit.net/html/elife-rp_author_instructions.html#dataavailability

    3. Reviewer #2 (Public review):

      Summary:

      A study that furthers the molecular definition of PPGL (where prognosis is variable) and provides a wide range of sub-experiments to back up the findings. One of the key premises of the study is that identification of driver mutations in PPGL is incomplete and that compromises characterisation for prognostic purposes. This is a reasonable starting point on which to base some characterisation based on different methods.

      Strengths:

      The cohort is a reasonable size, and a useful validation cohort in the form of TCGA is used. Whilst it would be resource-intensive (though plausible given the rarity of the tumour type) to perform RNAseq on all PPGL samples in clinical practice, some potential proxies are proposed.

      Weaknesses:

      Performance of some of the proxy markers for transcriptional subtype is not presented.

      Limited prognostic information available.

      Comments on revisions:

      Having reviewed the responses to my comments and associated revisions, I am satisfied that they have been addressed.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This study presents an exploration of PPGL tumour bulk transcriptomics and identifies three clusters of samples (labeled as subtypes C1-C3). Each subtype is then investigated for the presence of somatic mutations, metabolism-associated pathways and inflammation correlates, and disease progression. The proposed subtype descriptions are presented as an exploratory study. The proposed potential biomarkers from this subtype are suitably caveated and will require further validation in PPGL cohorts together with a mechanistic study.  

      The first section uses WGCNA (a method to identify clusters of samples based on gene expression correlations) to discover three transcriptome-based clusters of PPGL tumours. The second section inspects a previously published snRNAseq dataset, and labels some of the published cells as subtypes C1, C2, C3 (Methods could be clarified here), among other cells labelled as immune cell types. Further details about how the previously reported single-nuclei were assigned to the newly described subtypes C1-C3 require clarification.

      Thank you for your valuable suggestion. In response to the reviewer’s request for further clarification on “how previously published single-nuclei data were assigned to the newly defined C1-C3 subtypes,” we have provided additional methodological details in the revised manuscript (lines 103-109). Specifically, we aggregated the single-nucleus RNA-seq data to the sample level by summing gene counts across nuclei to generate pseudo-bulk expression profiles. These profiles were then normalized for library size, log-transformed (log1p), and z-scaled across samples. Using genesets scores derived from our earlier WGCNA analysis of PPGLs, we defined transcriptional subtypes within the Magnus cohort (Supplementary Figure. 1C). We further analyzed the single-nucleus data by classifying malignant (chromaffin) nuclei as C1, C2, or C3 based on their subtype scores, while non-malignant nuclei (including immune, stromal, endothelial, and others) were annotated using canonical cell-type markers (Figure. 4A). 

      The tumour samples are obtained from multiple locations in the body (Figure 1A). It will be important to see further investigation of how the sample origin is distributed among the C1C3 clusters, and whether there is a sample-origin association with mutational drivers and disease progression.

      Thank you for your valuable suggestion. In the revised manuscript (lines 74-79), Figure. 1A, Table S1 and Supplementary Figure. 1A, we harmonized anatomic site annotations from our PPGL cohort and the TCGA cohort and analyzed the distribution of tumor origin (adrenal vs extra-adrenal) across subtypes. The site composition is essentially uniform across C1-C3— approximately 75% pheochromocytoma (PC) and 25% paraganglioma (PG)—with only minimal variation. Notably, the proportion of extra-adrenal origin (paraganglioma origin) is slightly higher in the C1 subtype (see Supplementary Figure 1A), which aligns with the biological characteristics of tumors from this anatomical site, which typically exhibit more aggressive behavior.

      Reviewer #2 (Public Review):

      A study that furthers the molecular definition of PPGL (where prognosis is variable) and provides a wide range of sub-experiments to back up the findings. One of the key premises of the study is that identification of driver mutations in PPGL is incomplete and that compromises characterisation for prognostic purposes. This is a reasonable starting point on which to base some characterisation based on different methods. The cohort is a reasonable size, and a useful validation cohort in the form of TCGA is used. Whilst it would be resource-intensive (though plausible given the rarity of the tumour type) to perform RNA-seq on all PPGL samples in clinical practice, some potential proxies are proposed.

      We sincerely thank the reviewer for their positive assessment of our study’s rationale. We fully agree that RNA sequencing for all PPGL samples remains resource-intensive in current clinical practice, and its widespread application still faces feasibility challenges. It is precisely for this reason that, after defining transcriptional subtypes, we further focused on identifying and validating practical molecular markers and exploring their detectability at the protein level.

      In this study, we validated key markers such as ANGPT2, PCSK1N, and GPX3 using immunohistochemistry (IHC), demonstrating their ability to effectively distinguish among molecular subtypes (see Figure. 5). This provides a potential tool for the clinical translation of transcriptional subtyping, similar to the transcription factor-based subtyping in small cell lung cancer where IHC enables low-cost and rapid molecular classification.

      It should be noted that the subtyping performance of these markers has so far been preliminarily validated only in our internal cohort of 87 PPGL samples. We agree with the reviewer that largerscale, multi-center prospective studies are needed in the future to further establish the reliability and prognostic value of these markers in clinical practice.

      The performance of some of the proxy markers for transcriptional subtype is not presented.

      We agree with your comment regarding the need to further evaluate the performance of proxy markers for transcriptional subtyping. In our study, we have in fact taken this point into full consideration. To translate the transcriptional subtypes into a clinically applicable classification tool, we employed a linear regression model to compare the effect values (β values) of candidate marker genes across subtypes (Supplementary Figure. 1D-F). Genes with the most significant β values and statistical differences were selected as representative markers for each subtype.

      Ultimately, we identified ANGPT2, PCSK1N, and GPX3—each significantly overexpressed in subtypes C1, C2, and C3, respectively, and exhibiting the most pronounced β values—as robust marker genes for these subtypes (Figure. 5A and Supplementary Figure. 1D-F). These results support the utility of these markers in subtype classification and have been thoroughly validated in our analysis.

      There is limited prognostic information available.

      Thank you for your valuable suggestion. In this exploratory revision, we present the available prognostic signal in Figure. 5C. Given the current event numbers and follow-up time, we intentionally limited inference. We are continuing longitudinal follow-up of the PPGL cohort and will periodically update and report mature time-to-event analyses in subsequent work.

      Reviewer #1 (Recommendations for the authors):

      There is no deposition reference for the RNAseq transcriptomics data. Have the data been deposited in a suitable data repository?

      Thank you for your valuable suggestion. We have updated the Data availability section (lines 508–511) to clarify that the bulk-tissue RNA-seq datasets generated in this study are available from the corresponding author upon reasonable request.

      In the snRNAseq analysis of existing published data, clarify how cells were labelled as "C1", "C2", "C3", alongside cells labelled by cell type (the latter is described briefly in the Methods).

      Thank you for your valuable suggestion. In response to the reviewer’s request for further clarification on “how previously published single-nuclei data were assigned to the newly defined C1-C3 subtypes,” we have provided additional methodological details in the revised manuscript (lines 103-109). Specifically, we aggregated the single-nucleus RNA-seq data to the sample level by summing gene counts across nuclei to generate pseudo-bulk expression profiles. These profiles were then normalized for library size, log-transformed (log1p), and z-scaled across samples. Using genesets scores derived from our earlier WGCNA analysis of PPGLs, we defined transcriptional subtypes within the Magnus cohort (Supplementary Figure. 1C). We further analyzed the single-nucleus data by classifying malignant (chromaffin) nuclei as C1, C2, or C3 based on their subtype scores, while non-malignant nuclei (including immune, stromal, endothelial, and others) were annotated using canonical cell-type markers (Figure. 4A).

      Package versions should be included (e.g., CellChat, monocle2).

      We greatly appreciate your comments and have now added a dedicated “Software and versions” subsection in Methods. Specifically, we report Seurat (v4.4.0), sctransform (v0.4.2), CellChat (v2.2.0), monocle (v2.36.0; monocle2), pheatmap (v1.0.13), clusterProfiler (v4.16.0), survival (v3.8.3), and ggplot2 (v3.5.2) (lines 514-516). We also corrected a typographical error (“mafools” → “maftools”) (lines 463).

      Reviewer #2 (Recommendations for the authors):

      It would be helpful to provide a little more detail on the clinical composition of the cohort (e.g., phaeo vs paraganglioma, age, etc.) in the text, acknowledging that this is done in Figure 1.

      Thank you for your valuable suggestion. In the revision, we added Table S1 that provides a detailed summary of the clinical composition of the PPGL cohort. Specifically, we report the numbers and proportions (Supplementary Figure. 1A) of pheochromocytoma (PC) versus paraganglioma (PG), further subclassifying PG into head and neck (HN-PG), retroperitoneal (RPPG), and bladder (BC-PG).

      How many of each transcriptional subtype had driver mutations (germline or somatic)? This is included in the figures but would be worth mentioning in the text. Presumably, some of these may be present but not detected (e.g., non-coding variants), and this should be commented on. It is feasible that if methods to detect all the relevant genomic markers were improved, then the rate of tumours without driver mutations would be less and their prognostic utility would be more comprehensive.

      Thank you for your valuable suggestion. In the revision (lines 113–116), we now report the prevalence of driver mutations (germline or somatic) overall and by transcriptional subtype. We analyzed variant data across 84 PPGL-relevant genes from 179 tumors in the TCGA cohort and 30 tumors in Magnus’s cohort (Fig. 2A; Table S2). High-frequency genes were consistent with known biology—C1 enriched for [e.g., VHL/SDHB], C2 for [e.g., RET/HRAS], and C3 for [e.g., SDHA/SDHD]. We also note that a subset of tumors lacked an identifiable driver, which likely reflects current assay limitations (e.g., non-coding or structural variants, subclonality, and purity effects). Broader genomic profiling (deep WGS/long-read, RNA fusion, methylation) would be expected to reduce the “driver-negative” fraction and further enhance the prognostic utility of these classifiers.

      ANGPT2 provides a reasonable predictive capacity for the C1 subtype as defined by the ROC AUC. What was the performance of the PCSK1N and GPX3 as markers of the other subtypes?

      We agree with your comment regarding the need to further evaluate the performance of proxy markers for transcriptional subtyping, and we have supplemented the analysis with ROC and AUC values for two additional parameters (Author response image 1 , see below). Furthermore, in our study, we have in fact taken this point into full consideration. To translate the transcriptional subtypes into a clinically applicable classification tool, we employed a linear regression model to compare the effect values (β values) of candidate marker genes across subtypes (Supplementary Figure. 1D-F). Genes with the most significant β values and statistical differences were selected as representative markers for each subtype.

      Ultimately, we identified ANGPT2, PCSK1N, and GPX3—each significantly overexpressed in subtypes C1, C2, and C3, respectively, and exhibiting the most pronounced β values—as robust marker genes for these subtypes (Figure. 5A and Supplementary Figure. 1D-F). These results support the utility of these markers in subtype classification and have been thoroughly validated in our analysis.

      Author response image 1.

      Extended Data Figure A-B. (A) The ROC curve illustrates the diagnostic ability to distinguish PCSK1N expression in PPGLs, specifically differentiating subtype C2 from non-C2 subtypes. The red dot indicates the point with the highest sensitivity (93.1%) and specificity (82.8%). AUC, the area under the curve. (B) The ROC curve illustrates the diagnostic ability to distinguish GPX3 expression in PPGLs, specifically differentiating subtype C3 from non-C3 subtypes. The red dot indicates the point with the highest sensitivity (83.0%) and specificity (58.8%). AUC, the area under the curve.

      In the discussion, I think it would be valuable to summarise existing clinical/molecular predictors in PPGL and, acknowledging that their performance may be limited, compare them to the potential of these novel classifiers.

      Thank you for your valuable suggestion. We have added a concise overview of established clinical and molecular predictors in PPGL and compared them with the potential of our transcriptional classifiers. The new paragraph (Discussion, lines 315–338) now reads:

      “Compared to existing clinical and molecular predictors, risk assessment in PPGL has long relied on the following indicators: clinicopathological features (e.g., tumor size, non-adrenal origin, specific secretory phenotype, Ki-67 index), histopathological scoring systems (such as PASS/GAPP), and certain genetic alterations (including high-risk markers like SDHB inactivation mutations, as well as susceptibility gene mutations in ATRX, TERT promoter, MAML3, VHL, NF1, among others). Although these metrics are highly actionable in clinical practice, they exhibit several limitations: first, current molecular markers only cover a subset of patients, and technical constraints hinder the detection of many potentially significant variants (e.g., non-coding mutations), thereby compromising the comprehensiveness of prognostic evaluation; second, histopathological scoring is susceptible to interobserver variability; furthermore, the lack of standardized detection and evaluation protocols across institutions limits the comparability and generalizability of results. Our transcriptomic classification system—comprising C1 (pseudohypoxic/angiogenic signature), C2 (kinase-signaling signature), and C3 (SDHx-related signature)—provides a complementary approach to PPGL risk assessment. These subtypes reflect distinct biological backgrounds tied to specific genetic alterations and can be approximated by measuring the expression of individual genes (e.g., ANGPT2, PCSK1N, or GPX3). This study demonstrates that the classifier offers three major advantages: first, it accurately distinguishes subtypes with coherent biological features; second, it retains significant predictive value even after adjusting for clinical covariates; third, it can be implemented using readily available assays such as immunohistochemistry. These findings suggest that integrating transcriptomic subtyping with conventional clinical markers may offer a more comprehensive and generalizable risk stratification framework. However, this strategy would require validation through multi-center prospective studies and standardization of detection protocols.”

      A little more explanation of the principles behind WGCNA would be useful in the methods.

      We are grateful for your comments. We have expanded the Methods to briefly explain the principles of WGCNA (lines 426-454). In short, WGCNA constructs a weighted coexpression network from normalized gene expression, identifies modules of tightly co-expressed genes, summarizes each module by its eigengene (the first principal component), and then correlates module eigengenes with phenotypes (e.g., transcriptional subtypes) to highlight biologically meaningful gene sets and candidate hub genes. We now specify our preprocessing, choice of softthresholding power to approximate scale-free topology, module detection/merging criteria, and the statistics used for module–trait association and downstream gene-set scoring. 

      On line 234, I think the figure should be 5C?

      We greatly appreciate your comments and Correct to Figure 5C.

    1. eLife Assessment

      This important series of studies provides converging results from complementary neuroimaging and behavioral experiments to identify human brain regions involved in representing regular geometric shapes and their core features. Geometric shape concepts are present across diverse human cultures and possibly involved in human capabilities such as numerical cognition and mathematical reasoning. Identifying the brain networks involved in geometric shape representation is of broad interest to researchers studying human visual perception, reasoning, and cognition. The evidence supporting the presence of representation of geometric shape regularity in dorsal parietal and prefrontal cortex is solid, but does not directly demonstrate that these circuits overlap with those involved in mathematical reasoning. Furthermore, the links to defining features of geometric objects and with mathematical and symbolic reasoning would benefit from stronger evidence from more fine-tuned experimental tasks varying the stimuli and experience.

    2. Reviewer #1 (Public review):

      This paper examines how geometric regularities in abstract shapes (e.g., parallelograms, kites) are perceived and processed in the human brain. The manuscript contains multimodal data (behavior, fMRI, MEG) from adults and additional fMRI data from 6-year-old children. The key findings show that (1) processing geometric shapes lead to reduced activity in ventral areas in comparison to complex stimuli and increased activity in intraparietal and inferior temporal regions, (2) the degree of geometric regularity modulates activity in intraparietal and inferior temporal regions, (3) similarity in neural representation of geometric shapes can be captured early by using CNN models and later by models of geometric regularity. In addition to these novel findings, the paper also includes a replication of behavioral data, showing that the perceptual similarity structure amongst the geometric stimuli used can be explained by a combination of visual similarities (as indexed by feedforward CNN model of ventral visual pathway) and geometric features. The paper comes with openly accessible code in a well-documented GitHub repository and the data will be published with the paper on OpenNeuro.

      In the revised version of this manuscript, the authors clarified certain aspects of the task design, added critical detail to the description of the methods, and updated the figures to show unsmoothed data and variability across participants. Importantly, the authors thoroughly discussed potential task effects (for the fMRI data only) and added additional analyses that indicate that the effects are unlikely to be driven by linguistic labels/name availability of the stimuli.

      Comments on the revision:

      Thank you for carefully addressing all my concerns and especially for clarifying the task design.

    3. Reviewer #2 (Public review):

      Summary

      The current study seeks to understand the neural mechanisms underlying geometric reasoning. Using fMRI with both children and adults, the authors found that contrasting simple geometric shapes with naturalistic images (faces, tools, houses) led to responses in the dorsal visual stream, rather than ventral regions that are generally thought to represent shape properties. The author's followed up on this result using computational modeling and MEG to show that geometric properties explain distinct variance in the neural response than what is captured by a CNN.

      Strengths

      These findings contribute much-needed neural and developmental data to the ongoing debate regarding shape processing in the brain and offer additional insights into why CNNs may have difficulty with shape processing. The motivation and discussion for the study is appropriately measured, and I appreciate the authors' use of multiple populations, neuroimaging modalities, and computational models in explore this question.

      Weaknesses

      The presence of activation in aIPS led the authors to interpret their results to mean that geometric reasoning draws on the same processes as mathematical thinking. However, there is only weak and indirect evidence in the current study that geometric reasoning, as its tested here, draws on the same circuits as math.

    4. Reviewer #3 (Public review):

      Summary:

      The authors report converging evidence from behavioral studies as well as several brain-imaging techniques that geometric figures, notably quadrilaterals, are processed differently in visual (lower activation) and spatial (greater) areas of the human brain than representative figures. Comparison of mathematical models to fit activity for geometric figures shows the best fit for abstract geometric features like parallelism and symmetry. The brain areas active for geometric figures are also active in processing mathematical concepts even in blind mathematicians, linking geometric shapes to abstract math concepts. The effects are stronger in adults than in 6-year-old Western children. Similar phenomena do not appear in great apes, suggesting that this is uniquely human and developmental.

      Strengths:

      Multiple converging techniques of brain imaging and testing of mathematical models showing special status of perception of abstract forms. Careful reasoning at every step of research and presentation of research, anticipating and addressing possible reservations. Connecting these findings to other findings, brain, behavior, and historical/anthropological to suggest broad and important fundamental connections between abstract visual-spatial forms and mathematical reasoning.

      Weaknesses:

      I have reservations of the authors' use of "symbolic." They seem to interpret "symbolic" as relying on "discrete, exact, rule-based features." Words are generally considered to symbolic (that is their major function), yet words do not meet those criteria. Depictions of objects can be regarded as symbolic because they represent real objects, they are not the same as the object (as Magritte observed). If so then perhaps depictions of quadrilaterals are also symbolic but then they do not differ from depictions of objects on that quality. Relatedly, calling abstract or generalized representations of forms a distinct "language of thought" doesn't seem supportable by the current findings. Minimally, a language has elements that are combined more or less according to rules. The authors present evidence for geometric forms as elements but nowhere is there evidence for combining them into meaningful strings.

      Further thoughts

      Incidentally, there have been many attempts at constructing visual languages from visual elements combined by rules, that is, mapping meaning to depictions. Many written languages like Egyptian hieroglyphics or Mayan or Chinese, began that way; there are current attempts using emoji. Apparently, mapping sound to discrete letters, alphabets, is more efficient and was invented once but spread. That said, for restricted domains like maps, circuit diagrams, networks, chemical interactions, mathematics, and more, visual "languages" work quite well.

      The findings are striking and as such invite speculation about their meaning and limitations. The images of real objects seem to be interpreted as representations of 3D objects as they activate the same visual areas as real objects. By contrast, the images of 2D geometric forms are not interpreted as representations of real objects but rather seemingly as 2D abstractions. It would be instructive to investigate stimuli that are on a continuum from representational to geometric, e. g., real objects that have simple geometric forms like table tops or boxes under various projections or balls or buildings that are rectangular or triangular. Objects differ from geometric forms in many ways: 3D rather than 2D, more complicated shapes; internal features as well as outlines. The geometric figures used are flat, 2-D, but much geometry is 3-D (e. g. cubes) with similar abstract features. The feature space of geometry is more than parallelism and symmetry; angles are important for example. Listing and testing features would be fascinating.

      Can we say that mathematical thinking began with the regularities of shapes or with counting, or both? External representations of counting go far back into prehistory; tallies are frequent and wide-spread. Infants are sensitive to number across domains as are other primates (and perhaps other species). Finding overlapping brain areas for geometric forms and number is intriguing but doesn't show how they are related.

      Categories are established in part by contrast categories; are quadrilaterals and triangles and circles different categories? As for quadrilaterals, the authors say some are "completely irregular." Not really; they are still quadrilaterals, if atypical. See Eleanor Rosch's insightful work on (visual) categories. One wonders about distinguishing squashed quadrilaterals from squashed triangles.

      What in human experience but not the experience of close primates would drive the abstraction of these geometric properties? It's easy to make a case for elaborate brain processes for recognizing and distinguishing things in the world, shared by many species, but the case for brain areas sensitive to abstracting geometric figures is harder. The fact that these areas are active in blind mathematicians and that they are parietal areas suggest that what is important is spatial far more than visual. Could these geometric figures and their abstract properties be connected in some way to behavior, perhaps with fabrication, construction or use of objects? Or with other interactions with complex objects and environments where symmetry and parallelism (and angles and curvature--and weight and size) would be important? Manual dexterity and fabrication also distinguish humans from great apes (quantitatively not qualitatively) and action drives both visual and spatial representations of objects and spaces in the brain. I certainly wouldn't expect the authors to add research to this already packed paper, but raising some of the conceptual issues would contribute to the significance of the paper.

    5. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Weakness:

      I wonder how task difficulty and linguistic labels interact with the current findings. Based on the behavioral data, shapes with more geometric regularities are easier to detect when surrounded by other shapes. Do shape labels that are readily available (e.g., "square") help in making accurate and speedy decisions? Can the sensitivity to geometric regularity in intraparietal and inferior temporal regions be attributed to differences in task difficulty? Similarly, are the MEG oddball detection effects that are modulated by geometric regularity also affected by task difficulty?

      We see two aspects to the reviewer’s remarks.

      (1) Names for shapes.

      On the one hand, is the question of the impact of whether certain shapes have names and others do not in our task. The work presented here is not designed to specifically test the effect of formal western education; however, in previous work (Sablé-Meyer et al., 2021), we noted that the geometric regularity effect remains present even for shapes that do not have specific names, and even in participants who do not have names for them. Thus, we replicated our main effects with both preschoolers and adults that did not attend formal western education and found that our geometric feature model remained predictive of their behavior; we refer the reader to this previous paper for an extensive discussion of the possible role of linguistic labels, and the impact of the statistics of the environment on task performance.  

      What is more, in our behavior experiments we can discard data from any shape that is has a name in English and run our model comparison again. Doing so diminished the effect size of the geometric feature model, but it remained predictive of human behavior: indeed, if we removed all shapes but kite, rightKite, rustedHinge, hinge and random (i.e., more than half of our data, and shapes for which we came up with names but there are no established names), we nevertheless find that both models significantly correlate with human behavior—see plot in Author response image 1, equivalent of our Fig. 1E with the remaining shapes.

      Author response image 1.

      An identical analysis on the MEG leads to two noisy but significant clusters (CNN: 64.0ms to 172.0ms; then 192.0ms to 296.0ms; both p<.001: Geometric Features: 312.0ms to 364.0ms with p=.008). We have improved our manuscript thanks to the reviewer’s observation by adding a figure with the new behavior analysis to the supplementary figures and in the result section of the behavior task. We now refer to these analysis where appropriate:

      (intro) “The effect appeared as a human universal, present in preschoolers, first-graders, and adults without access to formal western math education (the Himba from Namibia), and thus seemingly independent of education and of the existence of linguistic labels for regular shapes.”

      (behavior results) “Finally, to separate the effect of name availability and geometric features on behavior, we replicated our analysis after removing the square, rectangle, trapezoids, rhombus and parallelogram from our data (Fig. S5D). This left us with five shapes, and an RDM with 10 entries, When regressing it in a GLM with our two models, we find that both models are still significant predictors (p<.001). The effect size of the geometric feature model is greatly reduced, yet remained significantly higher than that of the neural network model (p<.001).”

      (meg results) “This analysis yielded similar clusters when performed on a subset of shapes that do not have an obvious name in English, as was the case for the behavior analysis (CNN Encoding: 64.0ms to 172.0ms; then 192.0ms to 296.0ms; both p<.001: Geometric Features: 312.0ms to 364.0ms with p=.008).”

      (discussion, end of behavior section) “Previously, we only found such a significant mixture of predictors in uneducated humans (whether French preschoolers or adults from the Himba community, mitigating the possible impact of explicit western education, linguistic labels, and statistics of the environment on geometric shape representation) (Sablé-Meyer et al., 2021).”

      Perhaps the referee’s point can also be reversed: we provide a normative theory of geometric shape complexity which has the potential to explain why certain shapes have names: instead of seeing shape names as the cause of their simpler mental representation, we suggest that the converse could occur, i.e. the simpler shapes are the ones that are given names.

      (2) Task difficulty

      On the other hand is the question of whether our effect is driven by task difficulty. First, we would like to point out that this point could apply to the fMRI task, which asks for an explicit detection of deviants, but does not apply to the MEG experiment. In MEG, participants passively looked at sequences of shapes which, for a given block, comprising many instances of a fixed standard shape and rare deviants–even if they notice deviants, they have no task related to them. Yet two independent findings validated the geometric features model: there was a large effect of geometric regularity on the MEG response to deviants, and the MEG dissimilarity matrix between standard shapes correlated with a model based on geometric features, better than with a model based on CNNs. While the response to rare deviants might perhaps be attributed to “difficulty” (assuming that, in spite of the absence of an explicit task, participants try to spot the deviants and find this self-imposed task more difficult in runs with less regular shapes), it seems very hard to explain the representational similarity analysis (RSA) findings based on difficulty. Indeed, what motivated us to use RSA analysis in both fMRI and MEG was to stop relying on the response to deviants, and use solely the data from standard or “reference” shapes, and model their neural response with theory-derived regressors.

      We have updated the manuscript in several places to make our view on these points clearer:

      (experiment 4) “This design allowed us to study the neural mechanisms of the geometric regularity effect without confounding effects of task, task difficulty, or eye movements.”

      (figure 4, legend) “(A) Task structure: participants passively watch a constant stream of geometric shapes, one per second (presentation time 800ms). The stimuli are presented in blocks of 30 identical shapes up to scaling and rotation, with 4 occasional deviant shape. Participants do not have a task to perform beside fixating.”

      Reviewer #2 (Public review):

      Weakness:

      Given that the primary take away from this study is that geometric shape information is found in the dorsal stream, rather than the ventral stream there is very little there is very little discussion of prior work in this area (for reviews, see Freud et al., 2016; Orban, 2011; Xu, 2018). Indeed, there is extensive evidence of shape processing in the dorsal pathway in human adults (Freud, Culham, et al., 2017; Konen & Kastner, 2008; Romei et al., 2011), children (Freud et al., 2019), patients (Freud, Ganel, et al., 2017), and monkeys (Janssen et al., 2008; Sereno & Maunsell, 1998; Van Dromme et al., 2016), as well as the similarity between models and dorsal shape representations (Ayzenberg & Behrmann, 2022; Han & Sereno, 2022).

      We thank the reviewer for this opportunity to clarify our writing. We want to use this opportunity to highlight that our primary finding is not about whether the shapes of objects or animals (in general) are processed in the ventral versus or the dorsal pathway, but rather about the much more restricted domain of geometric shapes such as squares and triangles. We propose that simple geometric shapes afford additional levels of mental representation that rely on their geometric features – on top of the typical visual processing. To the best of our knowledge, this point has not been made in the above papers.

      Still, we agree that it is useful to better link our proposal to previous ones. We have updated the discussion section titled “Two Visual Pathways” to include more specific references to the literature that have reported visual object representations in the dorsal pathway. Following another reviewer’s observation, we have also updated our analysis to better demonstrate the overlap in activation evoked by math and by geometry in the IPS, as well as include a novel comparison with independently published results.

      Overall, to address this point, we (i) show the overlap between our “geometry” contrast (shape > word+tools+houses) and our “math” contrast (number > words); (ii) we display these ROIs side by side with ROIs found in previous work (Amalric and Dehaene, 2016), and (iii) in each math-related ROIs reported in that article, we test our “geometry” (shape > word+tools+houses) contrast and find almost all of them to be significant in both population; see Fig. S5.

      Finally, within the ROIs identified with our geometry localizer, we also performed similarity analyses: for each region we extracted the betas of every voxel for every visual category, and estimated the distance (cross-validated mahalanobis) between different visual categories. In both ventral ROIs, in both populations, numbers were closer to shapes than to the other visual categories including text and Chinese characters (all p<.001). In adults, this result also holds for the right ITG (p=.021) and the left IPS (p=.014) but not the right IPS (p=.17). In children, this result did not hold in the areas.

      Naturally, overlap in brain activation does not suffice to conclude that the same computational processes are involved. We have added an explicit caveat about this point. Indeed, throughout the article,  we have been careful to frame our results in a way that is appropriate given our evidence, e.g. saying “Those areas are similar to those active during number perception, arithmetic, geometric sequences, and the processing of high-level math concepts” and “The IPS areas activated by geometric shapes overlap with those active during the comprehension of elementary as well as advanced mathematical concepts”. We have rephrased the possibly ambiguous “geometric shapes activated math- and number-related areas, particular the right aIPS.” into “geometric shapes activated areas independently found to be activated by math- and number-related tasks, in particular the right aIPS”.

      Reviewer #3 (Public review):

      Weakness:

      Perhaps the manuscript could emphasize that the areas recruited by geometric figures but not objects are spatial, with reduced processing in visual areas. It also seems important to say that the images of real objects are interpreted as representations of 3D objects, as they activate the same visual areas as real objects. By contrast, the images of geometric forms are not interpreted as representations of real objects but rather perhaps as 2D abstractions.

      This is an interesting possibility. Geometric shapes are likely to draw attention to spatial dimensions (e.g. length) and to do so in a 2D spatial frame of reference rather than the 3D representations evoked by most other objects or images. However, this possibility would require further work to be thoroughly evaluated, for instance by comparing usual 3D objects with rare instances of 2D ones (e.g. a sheet of paper, a sticker etc). In the absence of such a test, we refrained from further speculation on this point.

      The authors use the term "symbolic." That use of that term could usefully be expanded here.  

      The reviewer is right in pointing out that “symbolic” should have been more clearly defined. We now added in the introduction:

      (introduction) “[…] we sometimes refer to this model as “symbolic” because it relies on discrete, exact, rule-based features rather than continuous representations  (Sablé-Meyer et al., 2022). In this representational format, geometric shapes are postulated to be represented by symbolic expressions in a “language-of-thought”, e.g. “a square is a four-sided figure with four equal sides and four right angles” or equivalently by a computer-like program from drawing them in a Logo-like language (Sablé-Meyer et al., 2022).”

      Here, however, the present experiments do not directly probe this format of a representation. We have therefore simplified our wording and removed many of our use of the word “symbolic” in favor of the more specific “geometric features”.

      Pigeons have remarkable visual systems. According to my fallible memory, Herrnstein investigated visual categories in pigeons. They can recognize individual people from fragments of photos, among other feats. I believe pigeons failed at geometric figures and also at cartoon drawings of things they could recognize in photos. This suggests they did not interpret line drawings of objects as representations of objects.

      The comparison of geometric abilities across species is an interesting line of research. In the discussion, we briefly mention several lines of research that indicate that non-human primates do not perceive geometric shapes in the same way as we do – but for space reasons, we are reluctant to expand this section to a broader review of other more distant species. The referee is right that there is evidence of pigeons being able to perceive an invariant abstract 3D geometric shape in spite of much variation in viewpoint (Peissig et al., 2019) – but there does not seem to be evidence that they attend to geometric regularities specifically (e.g. squares versus non-squares). Also, the referee’s point bears on the somewhat different issue of whether humans and other animals may recognize the object depicted by a symbolic drawing (e.g. a sketch of a tree). Again, humans seem to be vastly superior in this domain, and research on this topic is currently ongoing in the lab. However, the point that we are making in the present work is specifically about the neural correlates of the representation of simple geometric shapes which by design were not intended to be interpretable as representations of objects.

      Categories are established in part by contrast categories; are quadrilaterals, triangles, and circles different categories?

      We are not sure how to interpret the referee’s question, since it bears on the definition of “category” (Spontaneous? After training? With what criterion?). While we are not aware of data that can unambiguously answer the reviewer’s question, categorical perception in geometric shapes can be inferred from early work investigating pop-out effects in visual search, e.g. (Treisman and Gormican, 1988): curvature appears to generate strong pop-out effects, and therefore we would expect e.g. circles to indeed be a different category than, say, triangles. Similarly, right angles, as well as parallel lines, have been found to be perceived categorically (Dillon et al., 2019).

      This suggests that indeed squares would be perceived as categorically different from triangles and circles. On the other hand, in our own previous work (Sablé-Meyer et al., 2021) we have found that the deviants that we generated from our quadrilaterals did not pop out from displays of reference quadrilaterals. Pop-out is probably not the proper criterion for defining what a “category” is, but this is the extent to which we can provide an answer to the reviewer’s question.

      It would be instructive to investigate stimuli that are on a continuum from representational to geometric, e.g., table tops or cartons under various projections, or balls or buildings that are rectangular or triangular. Building parts, inside and out. like corners. Objects differ from geometric forms in many ways: 3D rather than 2D, more complicated shapes, and internal texture. The geometric figures used are flat, 2-D, but much geometry is 3-D (e. g. cubes) with similar abstract features.

      We agree that there is a whole line of potential research here. We decided to start by focusing on the simplest set of geometric shapes that would give us enough variation in geometric regularity while being easy to match on other visual features. We agree with the reviewer that our results should hold both for more complex 2-D shapes, but also for 3-D shapes. Indeed, generative theories of shapes in higher dimensions following similar principles as ours have been devised (I. Biederman, 1987; Leyton, 2003).  We now mention this in the discussion:

      “Finally, this research should ultimately be extended to the representation of 3-dimensional geometric shapes, for which similar symbolic generative models have indeed been proposed (Irving Biederman, 1987; Leyton, 2003).”

      The feature space of geometry is more than parallelism and symmetry; angles are important, for example. Listing and testing features would be fascinating. Similarly, looking at younger or preferably non-Western children, as Western children are exposed to shapes in play at early ages.

      We agree with the reviewer on all point. While we do not list and test the different properties separately in this work, we would like to highlight that angles are part of our geometric feature model, which includes features of “right-angle” and “equal-angles” as suggested by the reviewer.

      We also agree about the importance of testing populations with limited exposure to formal training with geometric shapes. This was in fact a core aspect of a previous article of ours which tests both preschoolers, and adults with no access to formal western education – though no non-Western children (Sablé-Meyer et al., 2021). It remains a challenge to perform brain-imaging studies in non-Western populations (although see Dehaene et al., 2010; Pegado et al., 2014).

      What in human experience but not the experience of close primates would drive the abstraction of these geometric properties? It's easy to make a case for elaborate brain processes for recognizing and distinguishing things in the world, shared by many species, but the case for brain areas sensitive to processing geometric figures is harder. The fact that these areas are active in blind mathematicians and that they are parietal areas suggests that what is important is spatial far more than visual. Could these geometric figures and their abstract properties be connected in some way to behavior, perhaps with fabrication and construction as well as use? Or with other interactions with complex objects and environments where symmetry and parallelism (and angles and curvature--and weight and size) would be important? Manual dexterity and fabrication also distinguish humans from great apes (quantitatively, not qualitatively), and action drives both visual and spatial representations of objects and spaces in the brain. I certainly wouldn't expect the authors to add research to this already packed paper, but raising some of the conceptual issues would contribute to the significance of the paper.

      We refrained from speculating about this point in the previous version of the article, but share some of the reviewers’ intuitions about the underlying drive for geometric abstraction. As described in (Dehaene, 2026; Sablé-Meyer et al., 2022), our hypothesis, which isn’t tested in the present article, is that the emergence of a pervasive ability to represent aspects of the world as compact expressions in a mental “language-of-thought” is what underlies many domains of specific human competence, including some listed by the reviewer (tool construction, scene understanding) and our domain of study here, geometric shapes.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      Overall, I enjoyed reading this paper. It is clearly written and nicely showcases the amount of work that has gone into conducting all these experiments and analyzing the data in sophisticated ways. I also thought the figures were great, and I liked the level of organization in the GitHub repository and am looking forward to seeing the shared data on OpenNeuro. I have some specific questions I hope the authors can address.

      (1) Behavior

      - Looking at Figure 1, it seemed like most shapes are clustering together, whereas square, rectangle, and maybe rhombus and parallelogram are slightly more unique. I was wondering whether the authors could comment on the potential influence of linguistic labels. Is it possible that it is easier to discard the intruder when the shapes are readily nameable versus not?

      This is an interesting observation, but the existence of names for shapes does not suffice to explain all of our findings ; see our reply to the public comment.

      (2) fMRI

      - As mentioned in the public review, I was surprised that the authors went with an intruder task because I would imagine that performance depends on the specific combination of geometric shapes used within a trial. I assume it is much harder to find, for example, a "Right Hinge" embedded within "Hinge" stimuli than a "Right Hinge" amongst "Squares". In addition, the rotation and scaling of each individual item should affect regular shapes less than irregular shapes, creating visual dissimilarities that would presumably make the task harder. Can the authors comment on how we can be sure that the differences we pick up in the parietal areas are not related to task difficulty but are truly related to geometric shape regularities?

      Again, please see our public review response for a larger discussion of the impact of task difficulty. There are two aspects to answering this question.

      First, the task is not as the reviewer describes: the intruder task is to find a deviant shape within several slightly rotated and scaled versions of the regular shape it came from. During brain imaging, we did not ask participants to find an exemplar of one of our reference shape amidst copies of another, but rather a deviant version of one shape against copies of its reference version. We only used this intruder task with all pairs of shapes to generate the behavioral RSA matrix.

      Second, we agree that some of the fMRI effect may stem from task difficulty, and this motivated our use of RSA analysis in fMRI, and a passive MEG task. RSA results cannot be explained by task difficulty.

      Overall, we have tried to make the limitations of the fMRI design, and the motivation for turning to passive presentation in MEG, clearer by stating the issues more clearly when we introduce experiment 4:

      “The temporal resolution of fMRI does not allow to track the dynamic of mental representations over time. Furthermore, the previous fMRI experiment suffered from several limitations. First, we studied six quadrilaterals only, compared to 11 in our previous behavioral work. Second, we used an explicit intruder detection, which implies that the geometric regularity effect was correlated with task difficulty, and we cannot exclude that this factor alone explains some of the activations in figure 3C (although it is much less clear how task difficulty alone would explain the RSA results in figure 3D). Third, the long display duration, which was necessary for good task performance especially in children, afforded the possibility of eye movements, which were not monitored inside the 3T scanner and again could have affected the activations in figure 3C.”

      - How far in the periphery were the stimuli presented? Was eye-tracking data collected for the intruder task? Similar to the point above, I would imagine that a harder trial would result in more eye movements to find the intruder, which could drive some of the differences observed here.

      A 1-degree bar was added to Figure 3A, which faithfully illustrates how the stimuli were presented in fMRI. Eye-tracking data was not collected during fMRI. Although the participants were explicitly instructed to fixate at the center of the screen and avoid eye movements, we fully agree with the referee that we cannot exclude that eye movements were present, perhaps more so for more difficult displays, and would therefore have contributed to the observed fMRI activations in experiment 3 (figure 3C). We now mention this limitation explicity at the end of experiment 3. However, crucially, this potential problem cannot apply to the MEG data. During the MEG task, the stimuli were presented one by one at the center of screen, without any explicit task, thus avoiding issues of eye movements. We therefore consider the MEG geometrical regularity effect, which comes at a relatively early latency (starting at ~160 ms) and even in a passive task, to provide the strongest evidence of geometric coding, unaffected by potential eye movement artefacts. 

      - I was wondering whether the authors would consider showing some un-thresholded maps just to see how widespread the activation of the geometric shapes is across all of the cortex.

      We share the uncorrected threshold maps in Fig. S3. for both adults and children in the category localizer, copied here as well. For the geometry task, most of the clusters identified are fairly big and survive cluster-corrected permutations; the uncorrected statistical maps look almost fully identical to the one presented in Fig. 3 (p<.001 map).

      - I'm missing some discussion on the role of early visual areas that goes beyond the RSA-CNN comparison. I would imagine that early visual areas are not only engaged due to top-down feedback (line 258) but may actually also encode some of the geometric features, such as parallel lines and symmetry. Is it feasible to look at early visual areas and examine what the similarity structure between different shapes looks like?

      If early visual areas encoded the geometric features that we propose, then even early sensor-level RSA matrices should show a strong impact of geometric features similarity, which is not what we find (figure 4D). We do, however, appreciate the referee’s request to examine more closely how this similarity structure looks like. We now provide a movie showing the significant correlation between neural activity and our two models (uncorrected participants); indeed, while the early occipital activity (around 110ms) is dominated by a significant correlation with the CNN model, there are also scattered significant sources associated to the symbolic model around these timepoints already.

      To test this further, we used beamformers to reconstruct the source-localized activity in calcarine cortex and performed an RSA analysis across that ROI. We find that indeed the CNN model is strongly significant at t=110ms (t=3.43, df=18, p=.003) while the geometric feature model is not (t=1.04, df=18, p=.31), and the CNN is significantly above the geometric feature model (t=4.25, df=18, p<.001). However, this result is not very stable across time, and there are significant temporal clusters around these timepoints associated to each model, with no significant cluster associated to a CNN > geometric (CNN: significant cluster from 88ms to 140ms, p<.001 in permutation based with 10000 permutations; geometric features has a significant cluster from 80ms to 104ms, p=.0475; no significant cluster on the difference between the two).

      (3) MEG

      - Similar to the fMRI set, I am a little worried that task difficulty has an effect on the decoding results, as the oddball should pop out more in more geometric shapes, making it easier to detect and easier to decode. Can the authors comment on whether it would matter for the conclusions whether they are decoding varying task difficulty or differences in geometric regularity, or whether they think this can be considered similarly?

      See above for an extensive discussion of the task difficulty effect. We point out that there is no task in the MEG data collection part. We have clarified the task design by updating our Fig. 4. Additionally, the fact that oddballs are more perceived more or less easily as a function of their geometric regularity is, in part, exactly the point that we are making – but, in MEG, even in the absence of a task of looking for them.

      - The authors discuss that the inflated baseline/onset decoding/regression estimates may occur because the shapes are being repeated within a mini-block, which I think is unlikely given the long ISIs and the fact that the geometric features model is not >0 at onset. I think their second possible explanation, that this may have to do with smoothing, is very possible. In the text, it said that for the non-smoothed result, the CNN encoding correlates with the data from 60ms, which makes a lot more sense. I would like to encourage the authors to provide readers with the unsmoothed beta values instead of the 100-ms smoothed version in the main plot to preserve the reason they chose to use MEG - for high temporal resolution!

      We fully agree with the reviewer and have accordingly updated the figures to show the unsmoothed data (see below). Indeed, there is now no significant CNN effect before ~60 ms (up to the accuracy of identifying onsets with our method).

      - In Figure 4C, I think it would be useful to either provide error bars or show variability across participants by plotting each participant's beta values. I think it would also be nice to plot the dissimilarity matrices based on the MEG data at select timepoints, just to see what the similarity structure is like.

      Following the reviewer’s recommendation, we plot the timeseries with SEM as shaded area, and thicker lines for statistically significant clusters, and we provide the unsmoothed version in figure Fig. 4. As for the dissimilarity matrices at select timepoints, this has now been added to figure Fig. 4.

      - To evaluate the source model reconstruction, I think the reader would need a little more detail on how it was done in the main text. How were the lead fields calculated? Which data was used to estimate the sources? How are the models correlated with the source data?

      We have imported some of the details in the main text as follows (as well as expanding the methods section a little):

      “To understand which brain areas generated these distinct patterns of activations, and probe whether they fit with our previous fMRI results, we performed a source reconstruction of our data. We projected the sensor activity onto each participant's cortical surfaces estimated from T1-images. The projection was performed using eLORETA and emptyroom recordings acquired on the same day to estimate noise covariance, with the default parameters of mne-bids-pipeline. Sources were spaced using a recursively subdivided octahedron (oct5). Group statistics were performed after alignement to fsaverage. We then replicated the RSA analysis […]”

      - In addition to fitting the CNN, which is used here to model differences in early visual cortex, have the authors considered looking at their fMRI results and localizing early visual regions, extracting a similarity matrix, and correlating that with the MEG and/or comparing it with the CNN model?

      We had ultimately decided against comparing the empirical similarity matrices from the MEG and fMRI experiments, first because the stimuli and tasks are different, and second because this would not be directly relevant to our goal, which is to evaluate whether a geometric-feature model accounts for the data. Thus, we systematically model empirical similarity matrices from fMRI and from MEG with our two models derived from different theories of shape perception in order to test predictions about their spatial and temporal dynamic. As for comparing the similarity matrix from early visual regions in fMRI with that predicted by the CNN model, this is effectively visible from our Fig. 3D where we perform searchlight RSA analysis and modeling with both the CNN and the geometric feature model; bilaterally, we find a correlation with the CNN model, although it sometimes overlap with predictions from the geometric feature model as well. We now include a section explaining this reasoning in appendix:

      “Representational similarity analysis also offers a way to directly compared similarity matrices measured in MEG and fMRI, thus allowing for fusion of those two modalities and tentatively assigning a “time stamp” to distinct MRI clusters. However, we did not attempt such an analysis here for several reasons. First, distinct tasks and block structures were used in MEG and fMRI. Second, a smaller list of shapes was used in fMRI, as imposed by the slower modality of acquisition. Third, our study was designed as an attempt to sort out between two models of geometric shape recognition. We therefore focused all analyses on this goal, which could not have been achieved by direct MEG-fMRI fusion, but required correlation with independently obtained model predictions.”

      Minor comments

      - It's a little unclear from the abstract that there is children's data for fMRI only.

      We have reworded the abstract to make this unambiguous

      - Figures 4a & b are missing y-labels.

      We can see how our labels could be confused with (sub-)plot titles and have moved them to make the interpretation clearer.

      - MEG: are the stimuli always shown in the same orientation and size?

      They are not, each shape has a random orientation and scaling. On top of a task example at the top of Fig. 4, we have now included a clearer mention of this in the main text when we introduce the task:

      “shapes were presented serially, one at a time, with small random changes in rotation and scaling parameters, in miniblocks with a fixed quadrilateral shape and with rare intruders with the bottom right corner shifted by a fixed amount (Sablé-Meyer et al., 2021)”

      - To me, the discussion section felt a little lengthy, and I wonder whether it would benefit from being a little more streamlined, focused, and targeted. I found that the structure was a little difficult to follow as it went from describing the result by modality (behavior, fMRI, MEG) back to discussing mostly aspects of the fMRI findings.

      We have tried to re-organize and streamline the discussion following these comments.

      Then, later on, I found that especially the section on "neurophysiological implementation of geometry" went beyond the focus of the data presented in the paper and was comparatively long and speculative.

      We have reexamined the discussion, but the citation of papers emphasizing a representation of non-accidental geometric properties in non-human animals was requested by other commentators on our article; and indeed, we think that they are relevant in the context of our prior suggestion that the composition of geometric features might be a uniquely human feature – these papers suggest that individual features may not, and that it is therefore compositionality which might be special to the human brain. We have nevertheless shortened it.

      Furthermore, we think that this section is important because symbolic models are often criticized for lack of a plausible neurophysiological implementation. It is therefore important to discuss whether and how the postulated symbolic geometric code could be realized in neural circuits. We have added this justification to the introduction of this section.

      Reviewer #2 (Recommendations for the authors):

      (1) If the authors want to specifically claim that their findings align with mathematical reasoning, they could at least show the overlap between the activation maps of the current study and those from prior work.

      This was added to the fMRI results. See our answers to the public review.

      (2) I wonder if the reason the authors only found aIPS in their first analysis (Figure 2) is because they are contrasting geometric shapes with figures that also have geometric properties. In other words, faces, objects, and houses also contain geometric shape information, and so the authors may have essentially contrasted out other areas that are sensitive to these features. One indication that this may be the case is that the geometric regularity effect and searchlight RSA (Figure 3) contains both anterior and posterior IPS regions (but crucially, little ventral activity). It might be interesting to discuss the implications of these differences.

      Indeed, we cannot exclude that the few symmetries, perpendicularity and parallelism cues that can be presented in faces, objects or houses were processed as such, perhaps within the ventral pathway, and that these representations would have been subtracted out. We emphasize that our subtraction isolates the geometrical features that are present in simple regular geometric shapes, over and above those that might exist in other categories. We have added this point to the discussion:

      “[… ] For instance, faces possess a plane of quasi-symmetry, and so do many other man-made tools and houses. Thus, our subtraction isolated the geometrical features that are present in simple regular geometric shapes (e.g. parallels, right angles, equality of length) over and above those that might already exist, in a less pure form, in other categories.”

      (3) I had a few questions regarding the MEG results.

      a. I didn't quite understand the task. What is a regular or oddball shape in this context? It's not clear what is being decoded. Perhaps a small example of the MEG task in Figure 4 would help?

      We now include an additional sub-figure in Fig. 4 to explain the paradigm. In brief: there is no explicit task, participants are simply asked to fixate. The shapes come in miniblocks of 30 identical reference shapes (up to rotation and scaling), among which some occasional deviant shapes randomly appear (created by moving the corner of the reference shape by some amount).

      b. In Figure 4A/B they describe the correlation with a 'symbolic model'. Is this the same as the geometric model in 4C?

      It is. We have removed this ambiguity by calling it “geometric model” and setting its color to the one associated to this model thought the article.

      c. The author's explanation for why geometric feature coding was slower than CNN encoding doesn't quite make sense to me. As an explanation, they suggest that previous studies computed "elementary features of location or motor affordance", whereas their study work examines "high-level mathematical information of an abstract nature." However, looking at the studies the authors cite in this section, it seems that these studies also examined the time course of shape processing in the dorsal pathway, not "elementary features of location or motor affordance." Second, it's not clear how the geometric feature model reflects high-level mathematical information (see point above about claiming this is related to math).

      We thank the referee for pointing out this inappropriate phrase, which we removed. We rephrased the rest of the paragraph to clarify our hypothesis in the following way:

      “However, in this work, we specifically probed the processing of geometric shapes that, if our hypothesis is correct, are represented as mental expressions that combine geometrical and arithmetic features of an abstract categorical nature, for instance representing “four equal sides” or “four right angles”. It seems logical that such expressions, combining number, angle and length information, take more time to be computed than the first wave of feedforward processing within the occipito-temporal visual pathway, and therefore only activate thereafter.”

      One explanation may be that the authors' geometric shapes require finer-grained discrimination than the object categories used in prior studies. i.e., the odd-ball task may be more of a fine-grained visual discrimination task. Indeed, it may not be a surprise that one can decode the difference between, say, a hammer and a butterfly faster than two kinds of quadrilaterals.

      We do not disagree with this intuition, although note that we do not have data on this point (we are reporting and modelling the MEG RSA matrix across geometric shapes only – in this part, no other shapes such as tools or faces are involved). Still, the difference between squares, rectangles, parallelograms and other geometric shapes in our stimuli is not so subtle. Furthermore, CNNs do make very fine grained distinctions, for instance between many different breeds of dogs in the IMAGENET corpus. Still, those sorts of distinctions capture the initial part of the MEG response, while the geometric model is needed only for the later part. Thus, we think that it is a genuine finding that geometric computations associated with the dorsal parietal pathway are slower than the image analysis performed by the ventral occipito-temporal pathway.

      d. CNN encoding at time 0 is a little weird, but the author's explanation, that this is explained by the fact that temporal smoothed using a 100 ms window makes sense. However, smoothing by 100 ms is quite a lot, and it doesn't seem accurate to present continuous time course data when the decoding or RSA result at each time point reflects a 100 ms bin. It may be more accurate to simply show unsmoothed data. I'm less convinced by the explanation about shape prediction.

      We agree. Following the reviewer’s advice, as well as the recommendation from reviewer 1, we now display unsmoothed plots, and the effects now exhibit a more reasonable timing (Figure 4D), with effects starting around ~60 ms for CNN encoding.

      (4) I appreciate the author's use of multiple models and their explanation for why DINOv2 explains more variance than the geometric and CNN models (that it represents both types of features. A variance partitioning analysis may help strengthen this conclusion (Bonner & Epstein, 2018; Lescroart et al., 2015).

      However, one difference between DINOv2 and the CNN used here is that it is trained on a dataset of 142 million images vs. the 1.5 million images used in ImageNet. Thus, DINOv2 is more likely to have been exposed to simple geometric shapes during training, whereas standard ImageNet trained models are not. Indeed, prior work has shown that lesioning line drawing-like images from such datasets drastically impairs the performance of large models (Mayilvahanan et al., 2024). Thus, it is unlikely that the use of a transformer architecture explains the performance of DINOv2. The authors could include an ImageNet-trained transformer (e.g., ViT) and a CNN trained on large datasets (e.g., ResNet trained on the Open Clip dataset) to test these possibilities. However, I think it's also sufficient to discuss visual experience as a possible explanation for the CNN and DINOv2 results. Indeed, young children are exposed to geometric shapes, whereas ImageNet-trained CNNs are not.

      We agree with the reviewer’s observation. In fact, new and ongoing work from the lab is also exploring this; we have included in supplementary materials exactly what the reviewer is suggesting, namely the time course of the correlation with ViT and with ConvNeXT. In line with the reviewers’ prediction, these networks, trained on much larger dataset and with many more parameters, can also fit the human data as well as DINOv2. We ran additional analysis of the MEG data with ViT and ConvNeXT, which we now report in Fig. S6 as well as in an additional sentence in that section:

      “[…] similar results were obtained by performing the same analysis, not only with another vision transformer network, ViT, but crucially using a much larger convolutional neural network, ConvNeXT, which comprises ~800M parameters and has been trained on 2B images, likely including many geometric shapes and human drawings. For the sake of completeness, RSA analysis in sensor space of the MEG data with these two models is provided in Fig. S6.”

      We conclude that the size and nature of the training set could be as important as the architecture – but also note that humans do not rely on such a huge training set. We have updated the text, as well as Fig. S6, accordingly by updating the section now entitled “Vision Transformers and Larger Neural Networks”, and the discussion section on theoretical models.

      (5) The authors may be interested in a recent paper from Arcaro and colleagues that showed that the parietal cortex is greatly expanded in humans (including infants) compared to non-human primates (Meyer et al., 2025), which may explain the stronger geometric reasoning abilities of humans.

      A very interesting article indeed! We have updated our article to incorporate this reference in the discussion, in the section on visual pathways, as follows:

      “Finally, recent work shows that within the visual cortex, the strongest relative difference in growth between human and non-human primates is localized in parietal areas (Meyer et al., 2025). If this expansion reflected the acquisition of new processing abilities in these regions, it  might explain the observed differences in geometric abilities between human and non-human primates (Sablé-Meyer et al., 2021).”

      Also, the authors may want to include this paper, which uses a similar oddity task and compelling shows that crows are sensitive to geometric regularity:

      Schmidbauer, P., Hahn, M., & Nieder, A. (2025). Crows recognize geometric regularity. Science Advances, 11(15), eadt3718. https://doi.org/10.1126/sciadv.adt3718

      We have ongoing discussions with the authors of this work and are  have prepared a response to their findings (Sablé-Meyer and Dehaene, 2025)–ultimately, we think that this discussion, which we agree is important, does not have its place in the present article. They used a reduced version of our design, with amplified differences in the intruders. While they did not test the fit of their model with CNN or geometric feature models, we did and found that a simple CNN suffices to account for crow behavior. Thus, we disagree that their conclusions follow from their results and their conclusions. But the present article does not seem to be the right platform to engage in this discussion.

      References

      Ayzenberg, V., & Behrmann, M. (2022). The Dorsal Visual Pathway Represents Object-Centered Spatial Relations for Object Recognition. The Journal of Neuroscience, 42(23), 4693-4710. https://doi.org/10.1523/jneurosci.2257-21.2022

      Bonner, M. F., & Epstein, R. A. (2018). Computational mechanisms underlying cortical responses to the affordance properties of visual scenes. PLoS Computational Biology, 14(4), e1006111. https://doi.org/10.1371/journal.pcbi.1006111

      Bueti, D., & Walsh, V. (2009). The parietal cortex and the representation of time, space, number and other magnitudes. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1525), 1831-1840.

      Dehaene, S., & Brannon, E. (2011). Space, time and number in the brain: Searching for the foundations of mathematical thought. Academic Press.

      Freud, E., Culham, J. C., Plaut, D. C., & Bermann, M. (2017). The large-scale organization of shape processing in the ventral and dorsal pathways. eLife, 6, e27576.

      Freud, E., Ganel, T., Shelef, I., Hammer, M. D., Avidan, G., & Behrmann, M. (2017). Three-dimensional representations of objects in dorsal cortex are dissociable from those in ventral cortex. Cerebral Cortex, 27(1), 422-434.

      Freud, E., Plaut, D. C., & Behrmann, M. (2016). 'What 'is happening in the dorsal visual pathway. Trends in Cognitive Sciences, 20(10), 773-784.

      Freud, E., Plaut, D. C., & Behrmann, M. (2019). Protracted developmental trajectory of shape processing along the two visual pathways. Journal of Cognitive Neuroscience, 31(10), 1589-1597.

      Han, Z., & Sereno, A. (2022). Modeling the Ventral and Dorsal Cortical Visual Pathways Using Artificial Neural Networks. Neural Computation, 34(1), 138-171. https://doi.org/10.1162/neco_a_01456

      Janssen, P., Srivastava, S., Ombelet, S., & Orban, G. A. (2008). Coding of shape and position in macaque lateral intraparietal area. Journal of Neuroscience, 28(26), 6679-6690.

      Konen, C. S., & Kastner, S. (2008). Two hierarchically organized neural systems for object information in human visual cortex. Nature Neuroscience, 11(2), 224-231.

      Lescroart, M. D., Stansbury, D. E., & Gallant, J. L. (2015). Fourier power, subjective distance, and object categories all provide plausible models of BOLD responses in scene-selective visual areas. Frontiers in Computational Neuroscience, 9(135), 1-20. https://doi.org/10.3389/fncom.2015.00135

      Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., & Brendel, W. (2024). In search of forgotten domain generalization. arXiv Preprint arXiv:2410.08258.

      Meyer, E. E., Martynek, M., Kastner, S., Livingstone, M. S., & Arcaro, M. J. (2025). Expansion of a conserved architecture drives the evolution of the primate visual cortex. Proceedings of the National Academy of Sciences, 122(3), e2421585122. https://doi.org/10.1073/pnas.2421585122

      Orban, G. A. (2011). The extraction of 3D shape in the visual system of human and nonhuman primates. Annual Review of Neuroscience, 34, 361-388.

      Romei, V., Driver, J., Schyns, P. G., & Thut, G. (2011). Rhythmic TMS over Parietal Cortex Links Distinct Brain Frequencies to Global versus Local Visual Processing. Current Biology, 21(4), 334-337. https://doi.org/10.1016/j.cub.2011.01.035

      Sereno, A. B., & Maunsell, J. H. R. (1998). Shape selectivity in primate lateral intraparietal cortex. Nature, 395(6701), 500-503. https://doi.org/10.1038/26752

      Summerfield, C., Luyckx, F., & Sheahan, H. (2020). Structure learning and the posterior parietal cortex. Progress in Neurobiology, 184, 101717. https://doi.org/10.1016/j.pneurobio.2019.101717

      Van Dromme, I. C., Premereur, E., Verhoef, B.-E., Vanduffel, W., & Janssen, P. (2016). Posterior Parietal Cortex Drives Inferotemporal Activations During Three-Dimensional Object Vision. PLoS Biology, 14(4), e1002445. https://doi.org/10.1371/journal.pbio.1002445

      Xu, Y. (2018). A tale of two visual systems: Invariant and adaptive visual information representations in the primate brain. Annu. Rev. Vis. Sci, 4, 311-336.

      Reviewer #3 (Recommendations for the authors):

      Bring into the discussion some of the issues outlined above, especially a) the spatial rather than visual of the geometric figures and b) the non-representational aspects of geometric form aspects.

      We thank the reviewer for their recommendations – see our response to the public review for more details.

    1. eLife Assessment

      The authors present valuable empirical and modelling evidence that statistical learning in speech perception may contain sub-processes. While the evidence for statistical learning effects is solid, the link between the pattern of effects (both empirical and simulated) and the theoretical concepts of the sub-processes (e.g., segmentation, anticipation) could be further developed. This work is of broad interest to researchers working on, or with, statistical learning, and to any researcher interested in the challenges of how data and models adjudicate between competing theoretical constructs.

    2. Reviewer #1 (Public review):

      Summary:

      This paper presents three experiments. Experiments 1 and 3 use a target detection paradigm to investigate the speed of statistical learning. The first experiment is a replication of Batterink, 2017, in which participants are presented with streams of uniform-length, trisyllabic nonsense words and asked to detect a target syllable. The results replicate previous findings, showing that learning (in the form of response time facilitation to later-occurring syllables within a nonsense word) occurs after a single exposure to a word. In the second experiment, participants are presented with streams of variable length nonsense words (two trisyllabic words and two disyllabic words), and perform the same task. A similar facilitation effect was observed as in Experiment 1. In Experiment 3 (newly added in the Revised manuscript), an adult version of the study by Johnson and Tyler is included. Participants were exposed to streams of words of either uniform length (all disyllabic) or mixed length (two disyllabic, two trisyllabic) and then asked to perform a familiarity judgment on a 1-5 scale on two words from the stream and two part-words. Performance was better in the uniform length condition.

      The authors interpret these findings as evidence that target detection requires mechanisms different from segmentation. They present results of a computational model to simulate results from the target detection task, and find that a bigram model can produce facilitation effects similar to the ones observed by human participants in Experiments 1 and 2 (though this model was not directly applied to test whether human-like effects were also produced to account for the data in Experiment 3). PARSER was also tested and produced differing results from those observed by humans across all three experiments. The authors conclude that the mechanisms involved in the target detection task are different from those involved in the word segmentation task.

      Strengths:

      The paper presents multiple experiments that provide internal replication of a key experimental finding, in which response times are facilitated after a single exposure to an embedded pseudoword. Both experimental data and results from a computational model are presented, providing converging approaches for understanding and interpreting the main results. The data are analyzed very thoroughly using mixed effects models with multiple explanatory factors. The addition of Experiment 3 provides direct evidence that the profile of performance for familiarity ratings and target detection differ as a function of word length variability.

      Weaknesses:

      (1) The concept of segmentation is still not quite clear. The authors seem to treat the testing procedure of Experiment 3 as synonymous with segmentation. But the ability to more strongly endorse words from the stream versus part-words as familiar does not necessarily mean that they have been successfully "segmented", as I elaborated on in my earlier review. In my view, it would be clearer to refer to segmentation as the mechanism or conceptual construct of segmenting continuous speech into discrete words. This ability to accurately segment component words could support familiarity judgments but is not necessary for above-chance familiarity or recognition judgments, which could be supported by more general memory signals. In other words, segmentation as an underlying ability is sufficient but not necessary for above-chance performance on familiarity-driven measures such as the one used in experiment 3.

      (2) The addition of experiment 3 is an added strength of the revised paper and provides more direct evidence of dissociations as a function of word length on the two tasks (target detection and familiarity ratings), compared to the prior strategy of just relying on previous work for this claim. However, it is not clear why the authors chose not to use the same stimuli as used in experiment 1 and 2, which would have allowed for more direct comparisons to be made. It should also be specified whether test items in the UWL and MWL were matched for overall frequency during exposure. Currently, the text does not specify whether test words in the UWL condition were taken from the high frequency or low frequency group; if they were taken from the high frequency group this would of course be a confound when comparing to the MWL condition. Finally, the definition of part-words should also be clarified,

      (3) The framing and argument for a prediction/anticipation mechanism was dropped in the Revised manuscript, but there are still a few instances where this framing and interpretation remain. E.g. Abstract - "we found that a prediction mechanism, rather than clustering, could explain the data from target detection." Discussion page 43 "Together, these results suggest that a simple prediction-based mechanism can explain the results from the target detection task, and clustering-based approaches such as PARSER cannot, contrary to previous claims."

      Minor (4) It was a bit unclear as to why a conceptual replication of Batterink 2017 was conducted, given that the target syllables at the beginning and end of the streams were immediately dropped from further analysis. Why include syllable targets within these positions in the design if they are not analyzed?

      (5) Figures 3 and 4 are plotted on different scales, which makes it difficult to visually compare the effects between word length conditions.

    3. Reviewer #2 (Public review):

      Summary:

      The valuable study investigates how statistical learning may facilitate a target detection task and whether the facilitation effect is related to statistical learning of word boundaries. Solid evidence is provided that target detection and word segmentation rely on different statistical learning mechanisms.

      Strengths:

      The study is well designed, using the contrast between the learning of words of uniform length and words of variable length to dissociate general statistical learning effects and effects related to word segmentation.

      Weaknesses:

      The study relies on the contrast between word length effects on target detection and word learning. However, the study only tested the target detection condition and did not attempt to replicate the word segmentation effect. It is true that the word segmentation effect has been replicated before but it is still worth reviewing the effect size of previous studies.

      The paper seems to distinguish prediction, anticipation, and statistical learning, but it is not entirely clear what each terms refers to.

      Comments on revisions:

      The authors did not address my concerns...they only replied to reviewer 1.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This paper presents two experiments, both of which use a target detection paradigm to investigate the speed of statistical learning. The first experiment is a replication of Batterink, 2017, in which participants are presented with streams of uniform-length, trisyllabic nonsense words and asked to detect a target syllable. The results replicate previous findings, showing that learning (in the form of response time facilitation to later-occurring syllables within a nonsense word) occurs after a single exposure to a word. In the second experiment, participants are presented with streams of variable-length nonsense words (two trisyllabic words and two disyllabic words) and perform the same task. A similar facilitation effect was observed as in Experiment 1. The authors interpret these findings as evidence that target detection requires mechanisms different from segmentation. They present results of a computational model to simulate results from the target detection task and find that an "anticipation mechanism" can produce facilitation effects, without performing segmentation. The authors conclude that the mechanisms involved in the target detection task are different from those involved in the word segmentation task.

      Strengths:

      The paper presents multiple experiments that provide internal replication of a key experimental finding, in which response times are facilitated after a single exposure to an embedded pseudoword. Both experimental data and results from a computational model are presented, providing converging approaches for understanding and interpreting the main results. The data are analyzed very thoroughly using mixed effects models with multiple explanatory factors.

      Weaknesses:

      In my view, the main weaknesses of this study relate to the theoretical interpretation of the results.

      (1) The key conclusion from these findings is that the facilitation effect observed in the target detection paradigm is driven by a different mechanism (or mechanisms) than those involved in word segmentation. The argument here I think is somewhat unclear and weak, for several reasons:

      First, there appears to be some blurring in what exactly is meant by the term "segmentation" with some confusion between segmentation as a concept and segmentation as a paradigm.

      Conceptually, segmentation refers to the segmenting of continuous speech into words. However, this conceptual understanding of segmentation (as a theoretical mechanism) is not necessarily what is directly measured by "traditional" studies of statistical learning, which typically (at least in adults) involve exposure to a continuous speech stream followed by a forced-choice recognition task of words versus recombined foil items (part-words or nonwords). To take the example provided by the authors, a participant presented with the sequence GHIABCDEFABCGHI may endorse ABC as being more familiar than BCG, because ABC is presented more frequently together and the learned association between A and B is stronger than between C and G. However, endorsement of ABC over BCG does not necessarily mean that the participant has "segmented" ABC from the speech stream, just as faster reaction times in responding to syllable C versus A do not necessarily indicate successful segmentation. As the authors argue on page 7, "an encounter to a sequence in which two elements co-occur (say, AB) would theoretically allow the learner to use the predictive relationship during a subsequent encounter (that A predicts B)." By the same logic, encoding the relationship between A and B could also allow for the above-chance endorsement of items that contain AB over items containing a weaker relationship.

      Both recognition performance and facilitation through target detection reflect different outcomes of statistical learning. While they may reflect different aspects of the learning process and/or dissociable forms of memory, they may best be viewed as measures of statistical learning, rather than mechanisms in and of themselves.

      Thanks for this nuanced discussion, and this is an important point that R2 also raised. We agree that segmentation can refer to both an experimental paradigm and a mechanism that accounts for learning in the experimental paradigm. In the experimental paradigm, participants are asked to identify which words they believe to be (whole) words from the continuous syllable stream. In the target-detection experimental paradigm, participants are not asked to identify words from continuous streams, and instead, they respond to the occurrences of a certain syllable. It’s possible that learners employ one mechanism in these two tasks, or that they employ separate mechanisms. It’s also the case that, if all we have is positive evidence for both experimental paradigms, i.e., learners can succeed in segmentation tasks as well as in target detection tasks with different types of sequences, we would have no way of talking about different mechanisms, as you correctly suggested that evidence for segmenting AB and processing B faster following A, is not evidence for different mechanisms.

      However, that is not the case. When the syllable sequences contain same-length subsequences (i.e., words), learning is indeed successful in both segmentation and target detection tasks. However, in studies such as Hoch et al. (2013), findings suggest that words from mixed-length sequences are harder to segment than words from uniform-length sequences. This finding exists in adult work (e.g., Hoch et al. 2013) as well as infant work (Johnson & Tyler, 2010), and replicated here in the newly included Experiment 3, which stands in contrast to the positive findings of the facilitation effect with mixed-length sequences in the target detection paradigm (one of our main findings in the paper). Thus, it seems to be difficult to explain, if the learning mechanisms were to be the same, why humans can succeed in mixed-length sequences in target detection (as shown in Experiment 2) but fail in uniform-length sequences (as shown in Hoch et al. and Experiment 3).

      In our paper, we have clarified these points describe the separate mechanisms in more detail, in both the Introduction and General Discussion sections.

      (2) The key manipulation between experiments 1 and 2 is the length of the words in the syllable sequences, with words either constant in length (experiment 1) or mixed in length (experiment 2). The authors show that similar facilitation levels are observed across this manipulation in the current experiments. By contrast, they argue that previous findings have found that performance is impaired for mixed-length conditions compared to fixed-length conditions. Thus, a central aspect of the theoretical interpretation of the results rests on prior evidence suggesting that statistical learning is impaired in mixed-length conditions. However, it is not clear how strong this prior evidence is. There is only one published paper cited by the authors - the paper by Hoch and colleagues - that supports this conclusion in adults (other mentioned studies are all in infants, which use very different measures of learning). Other papers not cited by the authors do suggest that statistical learning can occur to stimuli of mixed lengths (Thiessen et al., 2005, using infant-directed speech; Frank et al., 2010 in adults). I think this theoretical argument would be much stronger if the dissociation between recognition and facilitation through RTs as a function of word length variability was demonstrated within the same experiment and ideally within the same group of participants.

      To summarize the evidence of learning uniform-length and mixed-length sequences (which we discussed in the Introduction section), “even though infants and adults alike have shown success segmenting syllable sequences consisting of words that were uniform in length (i.e., all words were either disyllabic; Graf Estes et al., 2007; or trisyllabic, Aslin et al., 1998), both infants and adults have shown difficulty with syllable sequences consisting of words of mixed length (Johnson & Tyler, 2010; Johnson & Jusczyk, 2003a; 2003b; Hoch et al., 2013).” The newly added Experiment 3 also provided evidence for the difference in uniform-length and mixed-length sequences. Notably, we do not agree with the idea that infant work should be disregarded as evidence just because infants were tested with habituation methods; not only were the original findings (Saffran et al. 1996) based on infant work, so were many other studies on statistical learning.

      There are other segmentation studies in the literature that have used mixed-length sequences, which are worth discussing. In short, these studies differ from the Saffran et al. (1996) studies in many important ways, and in our view, these differences explain why the learning was successful. Of interest, Thiessen et al. (2005) that you mentioned was based on infant work with infant methods, and demonstrated the very point we argued for: In their study, infants failed to learn when mixed-length sequences were pronounced as adult-directed speech, and succeeded in learning given infant-directed speech, which contained prosodic cues that were much more pronounced. The fact that infants failed to segment mixed-length sequences without certain prosodic cues is consistent with our claim that mixed-length sequences are difficult to segment in a segmentation paradigm. Another such study is Frank et al. (2010), where continuous sequences were presented in “sentences”. Different numbers of words were concatenated into sentences where a 500ms break was present between each sentence in the training sequence. One sentence contained only one word, or two words, and in the longest sentence, there were 24 words. The results showed that participants are sensitive to the effect of sentence boundaries, which coincide with word boundaries. In the extreme, the one-word-per-sentence condition simply presents learners with segmented word forms. In the 24-word-per-sentence condition, there are nevertheless sentence boundaries that are word boundaries, and knowing these word boundaries alone should allow learners to perform above chance in the test phase. Thus, in our view, this demonstrates that learners can use sentence boundaries to infer word boundaries, which is an interesting finding in its own right, but this does not show that a continuous syllable sequence with mixed word lengths is learnable without additional information. In summary, to our knowledge, syllable sequences containing mixed word lengths are better learned when additional cues to word boundaries are present, and there is strong evidence that syllable sequences containing uniform-word lengths are learned better than mixed-length ones.

      Frank, M. C., Goldwater, S., Griffiths, T. L., & Tenenbaum, J. B. (2010). Modeling human performance in statistical word segmentation. Cognition, 117(2), 107-125.

      To address your proposal of running more experiments to provide stronger evidence for our theory, we were planning to run another study to have the same group of participants do both the segmentation and target detection paradigm as suggested, but we were unable to do so as we encountered difficulties to run English-speaking participants. Instead, we have included an experiment (now Experiment 3), showing the difference between the learning of uniform-length and mixed-length sequences with the segmentation paradigm that we have never published previously. This experiment provides further evidence for adults’ difficulties in segmenting mixed-length sequences.

      (3) The authors argue for an "anticipation" mechanism in explaining the facilitation effect observed in the experiments. The term anticipation would generally be understood to imply some kind of active prediction process, related to generating the representation of an upcoming stimulus prior to its occurrence. However, the computational model proposed by the authors (page 24) does not encode anything related to anticipation per se. While it demonstrates facilitation based on prior occurrences of a stimulus, that facilitation does not necessarily depend on active anticipation of the stimulus. It is not clear that it is necessary to invoke the concept of anticipation to explain the results, or indeed that there is any evidence in the current study for anticipation, as opposed to just general facilitation due to associative learning.

      Thanks for raising this point. Indeed, the anticipation effect we reported is indistinguishable from the facilitation effect that we reported in the reported experiments. We have dropped this framing.

      In addition, related to the model, given that only bigrams are stored in the model, could the authors clarify how the model is able to account for the additional facilitation at the 3rd position of a trigram compared to the 2nd position?

      Thanks for the question. We believe it is an empirical question whether there is an additional facilitation at the 3rd position of a trigram compared to the 2nd position. To investigate this issue, we conducted the following analysis with data from Experiment 1. First, we combined the data from two conditions (exact/conceptual) from Experiment 1 so as to have better statistical power. Next, we ran a mixed effect regression with data from syllable positions 2 and 3 only (i.e., data from syllable position 1 were not included). The fixed effect included the two-way interaction between syllable position and presentation, as well as stream position, and the random effect was a by-subject random intercept and stream position as the random slope. This interaction was significant (χ<sup>2</sup>(3) =11.73, p=0.008), suggesting that there is additional facilitation to the 3rd position compared to the 2nd position.

      For the model, here is an explanation of why the model assumes an additional facilitation to the 3rd position. In our model, we proposed a simple recursive relation between the RT of a syllable occurring for the nth time and the n+1<sup>th</sup> time, which is:

      and

      RT(1) = RT0 + stream_pos * stream_inc, where the n in RT(n) represents the RT for the n<sup>th</sup> presentation of the target syllable, stream_pos is the position (3-46) in the stream, and occurrence is the number of occurrences that the syllable has occurred so far in the stream.

      What this means is that the model basically provides an RT value for every syllable in the stream. Thus, for a target at syllable position 1, there is a RT value as an unpredictable target, and for targets at syllable position 2, there is a facilitation effect. For targets at syllable position 3, it is facilitated the same amount. As such, there is an additional facilitation effect for syllable position 3 because effects of predication are recursive.

      (4) In the discussion of transitional probabilities (page 31), the authors suggest that "a single exposure does provide information about the transitions within the single exposure, and the probability of B given A can indeed be calculated from a single occurrence of AB." Although this may be technically true in that a calculation for a single exposure is possible from this formula, it is not consistent with the conceptual framework for calculating transitional probabilities, as first introduced by Saffran and colleagues. For example, Saffran et al. (1996, Science) describe that "over a corpus of speech there are measurable statistical regularities that distinguish recurring sound sequences that comprise words from the more accidental sound sequences that occur across word boundaries. Within a language, the transitional probability from one sound to the next will generally be highest when the two sounds follow one another within a word, whereas transitional probabilities spanning a word boundary will be relatively low." This makes it clear that the computation of transitional probabilities (i.e., Y | X) is conceptualized to reflect the frequency of XY / frequency of X, over a given language inventory, not just a single pair. Phrased another way, a single exposure to pair AB would not provide a reliable estimate of the raw frequencies with which A and AB occur across a given sample of language.

      Thanks for the discussion. We understand your argument, but we respectively disagree that computing transitional probabilities must be conducted under a certain theoretical framework. In our humble opinion, computing transitional probabilities is a mathematical operation, and as such, it is possible to do so with the least amount of data possible that enables the mathematical operation, which concretely is a single exposure during learning. While it is true that a single exposure may not provide a reliable estimate of frequencies or probabilities, it does provide information with which the learner can make decisions.

      This is particularly true for topics under discussion regarding the minimal amount of exposure that can enable learning. It is important to distinguish the following two questions: whether learners can learn from a short exposure period (from a single exposure, in fact) and how long of an exposure period does the learner require for it to be considered to produce a reliable estimate of frequencies. Incidentally, given the fact that learners can learn from a single exposure based on Batterink (2017) and the current study, it does not appear that learners require a long exposure period to learn about transitional probabilities.

      (5) In experiment 2, the authors argue that there is robust facilitation for trisyllabic and disyllabic words alike. I am not sure about the strength of the evidence for this claim, as it appears that there are some conflicting results relevant to this conclusion. Notably, in the regression model for disyllabic words, the omnibus interaction between word presentation and syllable position did not reach significance (p= 0.089). At face value, this result indicates that there was no significant facilitation for disyllabic words. The additional pairwise comparisons are thus not justified given the lack of omnibus interaction. The finding that there is no significant interaction between word presentation, word position, and word length is taken to support the idea that there is no difference between the two types of words, but could also be due to a lack of power, especially given the p-value (p = 0.010).

      Thanks for the comment. Firstly, we believe there is a typo in your comment, where in the last sentence, we believe you were referring to the p-value of 0.103 (source: “The interaction was not significant (χ2(3) = 6.19, p= 0.103”). Yes, a null result with a frequentist approach cannot support a null claim, but Bayesian analyses could potentially provide evidence for the null.

      To this end, we conducted a Bayes factor analysis using the approach outlined in Harms and Lakens (2018), which generates a Bayes factor by computing a Bayesian information criterion for a null model and an alternative model. The alternative model contained a three-way interaction of word length, word presentation, and word position, whereas the null model contained a two-way interaction between word presentation and word position as well as a main effect of word length. Thus, the two models only differ in terms of whether there is a three-way interaction. The Bayes factor is then computed as exp[(BICalt − BICnull)/2]. This analysis showed that there is strong evidence for the null, where the Bayes Factor was found to be exp(25.65) which is more than 1011. Thus, there is no power issue here, and there is strong evidence for the null claim that word length did not interact with other factors in Experiment 2.

      There is another issue that you mentioned, of whether we should conduct pairwise comparisons if the omnibus interaction did not reach significance. This would be true given the original analysis plan, but we believe that a revised analysis plan makes more sense. In the revised analysis plan for Experiment 2, we start with the three-way interaction (as just described in the last paragraph). The three-way interaction was not significant, and after dropping the third interaction terms, the two-way interaction and the main effect of word length are both significant, and we use this as the overall model. Testing the significance of the omnibus interaction between presentation and syllable position, we found that this was significant (χ<sup>2</sup>(3) =49.77, p<0.001). This represents that, in one model, that the interaction between presentation and syllable position using data from both disyllabic and trisyllabic words. This was in addition to a significant fixed effect of word length (β=0.018, z=6.19, p<0.001). This should motivate the rest of the planned analysis, which regards pairwise comparisons in different word length conditions.

      (6) The results plotted in Figure 2 seem to suggest that RTs to the first syllable of a trisyllabic item slow down with additional word presentations, while RTs to the final position speed up. If anything, in this figure, the magnitude of the effect seems to be greater for 1st syllable positions (e.g., the RT difference between presentation 1 and 4 for syllable position 1 seems to be numerically larger than for syllable position 3, Figure 2D). Thus, it was quite surprising to see in the results (p. 16) that RTs for syllable position 1 were not significantly different for presentation 1 vs. the later presentations (but that they were significant for positions 2 and 3 given the same comparison). Is this possibly a power issue? Would there be a significant slowdown to 1st syllables if results from both the exact replication and conceptual replication conditions were combined in the same analysis?

      Thanks for the suggestion and your careful visual inspection of the data. After combining the data, the slowdown to 1st syllables is indeed significant. We have reported this in the results of Experiment 1 (with an acknowledgement to this review):

      Results showed that later presentations took significantly longer to respond to compared to the first presentation (χ<sup>2</sup>(3) = 10.70, p=0.014), where the effect grew larger with each presentation (second presentation: β=0.011, z=1.82, p=0.069; third presentation: β=0.019, z=2.40, p=0.016; fourth presentation: β=0.034, z=3.23, p=0.001).

      (7) It is difficult to evaluate the description of the PARSER simulation on page 36. Perhaps this simulation should be introduced earlier in the methods and results rather than in the discussion only.

      Thanks for the suggestions. We have added two separate simulations in the paper, which should describe the PARSER simulations sufficiently, as well as provide further information on the correspondence between the simulations and the experiments. Thanks again for the great review! We believe our paper has improved significantly as a result.

    1. eLife Assessment

      This study presents an important finding that ant nest structure and digging behavior depend on ant age demographics for a ground-dwelling ant species (Camponotus fellah). By asking whether ants employ age-polyethism in excavation, the authors address a long-standing question about how individuals in collectives determine the overall state of the task they must perform. The experimental evidence that the age of the ants and the group composition affect the digging of tunnels is convincing, and their model is able to replicate the colony's excavation dynamics qualitatively, results that may prove to be a key consideration for interpreting results from other studies in the field of social insect behavior.

    2. Reviewer #1 (Public review):

      This study investigates how ant group demographics influence nest structures and group behaviors of Camponotus fellah ants, a ground-dwelling carpenter ant species (found locally in Israel) that build subterranean nest structures. Using a quasi-2D cell filled with artificial sand, the authors perform two complementary sets of experiments to try to link group behavior and nest structure: first, the authors place a mated queen and several pupae into their cell and observe the structures that emerge both before and after the pupae eclose (i.e., "colony maturation" experiments); second, the authors create small groups (of 5, 10, or 15 ants, each including a queen) within a narrow age range (i.e., "fixed demographic" experiments) to explore the dependence of age on construction. Some of the fixed demographic instantiations included a manually induced catastrophic collapse event; the authors then compared emergency repair behavior to natural nest creation. Finally, the authors introduce a modified logistic growth model to describe the time-dependent nest area. The modification introduced parameters that allow for age-dependent behavior, and the authors use their fixed demographic experiments to set these parameters, and then apply the model to interpret the behavior of the colony maturation experiments. The main results of this paper are that for natural nest construction, nest areas, and morphologies depend on the age demographics of ants in the experiments: younger ants create larger nests and angled tunnels, while older ants tend to dig less and build predominantly vertical tunnels; in contrast, emergency response seems to elicit digging in ants of all ages to repair the nest.

      The experimental results are convincing, providing new information and important insights into nest and colony growth in a social insect species. A model, inspired by previous work but modified to capture experimental results, is in reasonable agreement with experiments and is more biologically relevant than previous models.

    3. Reviewer #2 (Public review):

      I enjoyed this paper and its examination of the relationship between overall density and age polyethism to reduce the computational complexity required to match nest size with population. I had some questions about the requirement that growth is infinite in such a solution, but these have been addressed by the authors in the responses and updated manuscript. I also enjoyed the discussion of whether collective behaviour is an appropriate framework in systems in which agents (or individuals) differ in the behavioural rules they employ, according to age, location, or information state. This is especially important in a system like social insects, typically held as a classic example of individual-as-subservient to whole, and therefore most likely to employ universal rules of behaviour. The current paper demonstrates a potentially continuous age-related change in target behaviour (excavation), and suggests an elegant and minimal solution to the requirement for building according to need in ants, avoiding the invocation of potentially complex cognitive mechanisms, or information states that all individuals must have access to in order to have an adaptive excavation output.

      The authors have addressed questions I had in the review process and the manuscripts is now clear in its communication and conclusions.

      The modelling approach is compelling, also allowing extrapolation to other group sizes and even other species. This to me is the main strength of the paper, as the answer to the question of whether it is younger or older ants that primarily excavate nests could have been answered by an individual tracking approach (albeit there are practical limitations to this, especially in the observation nest setup, as the authors point out). The analysis of the tunnel structure is also an important piece of the puzzle, and I really like the overall study.

    4. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      This study investigates how ant group demographics influence nest structures and group behaviors of Camponotus fellah ants, a ground-dwelling carpenter ant species (found locally in Israel) that build subterranean nest structures. Using a quasi-2D cell filled with artificial sand, the authors perform two complementary sets of experiments to try to link group behavior and nest structure: first, the authors place a mated queen and several pupae into their cell and observe the structures that emerge both before and after the pupae eclose (i.e., "colony maturation" experiments); second, the authors create small groups (of 5,10, or 15 ants, each including a queen) within a narrow age range (i.e., "fixed demographic" experiments) to explore the dependence of age on construction. Some of the fixed demographic instantiations included a manually induced catastrophic collapse event; the authors then compared emergency repair behavior to natural nest creation. Finally, the authors introduce a modified logistic growth model to describe the time-dependent nest area. The modification introduced parameters that allow for age-dependent behavior, and the authors use their fixed demographic experiments to set these parameters, and then apply the model to interpret the behavior of the colony maturation experiments. The main results of this paper are that for natural nest construction, nest areas, and morphologies depend on the age demographics of ants in the experiments: younger ants create larger nests and angled tunnels, while older ants tend to dig less and build predominantly vertical tunnels; in contrast, emergency response seems to elicit digging in ants of all ages to repair the nest.

      The experimental results are solid, providing new information and important insights into nest and colony growth in a social insect species. As presented, I still have some reservations about the model's contribution to a deeper understanding of the system. Additional context and explanation of the model, implications, and limitations would be helpful for readers.

      We sincerely thank Reviewer #1 for the time and effort dedicated to our manuscript's detailed review and assessment. The new revision suggestions were constructive, and we have provided a point-by-point response to address them.

      Reviewer #2 (Public review):

      I enjoyed this paper and its examination of the relationship between overall density and age polyethism to reduce the computational complexity required to match nest size with population. I had some questions about the requirement that growth is infinite in such a solution, but these have been addressed by the authors in the responses and the updated manuscript. I also enjoyed the discussion of whether collective behaviour is an appropriate framework in systems in which agents (or individuals) differ in the behavioural rules they employ, according to age, location, or information state. This is especially important in a system like social insects, typically held as a classic example of individual-as-subservient to whole, and therefore most likely to employ universal rules of behaviour. The current paper demonstrates a potentially continuous age-related change in target behaviour (excavation), and suggests an elegant and minimal solution to the requirement for building according to need in ants, avoiding the invocation of potentially complex cognitive mechanisms, or information states that all individuals must have access to in order to have an adaptive excavation output.

      The authors have addressed questions I had in the review process and the manuscript is now clear in its communication and conclusions.

      The modelling approach is compelling, also allowing extrapolation to other group sizes and even other species. This to me is the main strength of the paper, as the answer to the question of whether it is younger or older ants that primarily excavate nests could have been answered by an individual tracking approach (albeit there are practical limitations to this, especially in the observation nest setup, as the authors point out). The analysis of the tunnel structure is also an important piece of the puzzle, and I really like the overall study.

      We sincerely thank Reviewer #2 for the time and effort dedicated to our manuscript's detailed review and assessment.  

      Reviewer #1 (Recommendations for the authors):

      Thank you for the modifications. I found much of the additional information very helpful. I do still have a few comments, which I will include below.

      We thank the reviewer for this comment

      The authors provide some additional citations for the model, however, the ODE in refs 24 and 30 is different from what the authors present here, and different from what is presented in ref 29. Specifically, the additional "volume" term that multiplies the entire equation. Can the authors provide some additional context for their model in comparison to these models as well as how their model relates to other work?

      We thank the reviewer for this question. The primary difference between the logistic model (reference number: 24,30), and the saturation model (reference number: 29) is rooted in their assumptions on the scaling of the active number of ants that participate in the nest excavation and the nest volume.

      The logistic growth model ( 𝑑𝑉/𝑑𝑡 = α𝑉(1-V/Vs) describes the excavation in fixed-sized colonies (50, 100, 200) through a balance of two key processes : (1) positive feedback (α𝑉), where the digging efficiency increases with the nest size, and (2) negative feedback (1-V/Vs), where growth slows as the nest approaches a saturation (Vs). The model assumes that the number of actively excavating ants is linearly proportional to the nest volume (V). This represents a scenario where a large nest contains or can support more workers, which in turn increases the digging rates. While this does not require explicit communication between individuals, ants indirectly sense the global nest volume through stigmergic cues, such as pheromone depositions, encounter rates, while ignoring individual differences in age. 

      In contrast, the saturation model (𝑑𝑉/𝑑𝑡 = α𝑉(1-V/Vs)  assumes a constant number of ants is working throughout the excavation. The digging rate is therefore independent of the nest volume, this model imposes a different cognitive requirement ants must somehow assess the global nest slowing only due to the saturation term (1-V/Vs) as the nest approaches its target size. However, volume (V) and the overall number of ants in the nest. Thus, rather than relying on local cues, ants need more explicit communication or a sophisticated global perception mechanism that allows ants to sense the nest volume and the nest population to adjust the digging rates accordingly. Therefore, this model requires a more complex and less biologically plausible mechanism than the logistic model.

      In our age-dependent digging model in the manuscript, we explicitly sum the contribution of each ant towards the nest area expansion based on its age-dependent digging threshold (quantified from fixed demographics experiments) the sum over Thus, the term ‘V’ in the ‘ 𝑉(1-V/Vs) takes the same effect as sum over all ants in the equation (2) of our manuscript; they describe how the total excavation rate scales with the number of individuals. Under the simplifying assumption that the number of ants is proportional to the nest volume ‘V’, and that all ants dig at a constant rate, our equation (2) in the manuscript reduces to the logistic equation ‘𝑉(1-V/Vs)’ This implies that each ant individually assesses the nest volume and then digs at a rate ‘(1-V/Vs)’.

      Thus, we adopted the simpler model from the previously published ones, in which ants individually react to the local density cues and regulate their digging. This approach does not require a global assessment of the nest volume or the number of ants; a local perception of density triggers each ant’s decision to dig, likely modulated by the frequency of social contacts or chemical concentration, which serves as an indicator of the global nest area. The ant compares this locally perceived density to an innate, age-specific threshold. If the perceived local density exceeds its threshold (indicating insufficient area), it digs; otherwise, there is no digging. Thus, excavation dynamics in maturing colonies emerge from this collective response to local density cues, without any individual need to directly assess the global nest volume (V) or having explicit knowledge of the colony size (N).

      As suggested by the reviewer, we have added these points to the discussion, contrasting the previously published models with our age-dependent excavation models (line numbers: 283-290) “In our study, we adopted the simpler version of previously published age-independent excavation models, where individuals respond to local stigmergic cues such as encounter rates or pheromone concentrations, which serve as a proxy for the global nest volume (24,30). We minimally modified this model to include age-dependent density targets. According to our age-dependent digging model, each ant compares this perceived local density to its own innate age-specific digging threshold as quantified from the fixed demographics experiments. If the perceived local density exceeds its age-dependent area threshold (indicating insufficient area), it digs; otherwise, there is no digging. This mechanism eliminates the need for cognitively demanding global assessment of the total nest volume or the overall colony population, a requirement for the saturation model (29)”. 

      I still find it a little concerning that the age-independent model, though it cannot be correct, fits the data better than the age-dependent modification. It seems to me the models presented in refs 24, 29, and 30, which served as inspiration for the one presented here, do not have any deep theoretical origin, but were chosen for "being consistent with" the observed overall excavated volumes. Is this correct, and if so, how much can/should be gleaned about behavior from these models? Please provide some discussion of what is reasonable to expect from such a model as well as what the limitations might be.

      We thank the reviewer for the comment. 

      In our study, we make an important assumption, as described in the lines (line number : 161 - 164) of the manuscript, that ants rely on local cues during nest excavation, and individuals cannot distinguish between the fixed demographics and colony maturation conditions. This implies that the age-dependent target area identified in the fixed demographics experiments should also account for the excavation dynamics seen in the colony maturation experiments. 

      From the fixed demographics young and old experiments, we directly quantified that the younger ants excavate a significantly larger area than the older ants for the same group size. This age-dependent digging propensity is an experimental result, and not a model output. 

      We agree that the age-independent model fits the colony maturation experiments well, even though it's not a statistically better fit than the age-dependent model. However, the age-independent models in the references (24,29,30) fail to explain the empirically obtained excavation dynamics in the fixed demographics, young and old colonies. If indeed these models were true, then we would have observed similar excavated areas between the colony maturation, fixed demographics, young, and older colonies of the same size. Thus, the inconsistency of these models confirms that age-independent assumptions are biologically inadequate. These details are explicitly mentioned in lines (304 - 309).

      We believe that our model’s value is in providing a plausible explanation for the observed excavation dynamics in the colony maturation experiments, and generating testable predictions (Figure 4. C, and 4.D,  described in lines 199 - 216) about the percentage contribution of different age cohorts and queens to the excavated area from the colony maturation experiments. This prediction would not be possible with an age-independent model.

      Minor comments:

      Figure 2A: Please use a color other than white for the model... this curve is still very hard to see

      We thank the reviewer for the comment. The colour is changed to yellow. 

      Figure 4A: Should quoted confidence intervals for slope and intercept be swapped?

      Yes, we thank the reviewer for pointing this out. The labels for the slope and intercept were swapped. We corrected this in the current revised version 2. 

      Figure 5 D-F: Can the authors show data points and confidence intervals instead of bar graphs? The error bars dipping below zero do not clearly represent the data.

      We thank the reviewer for the comment. We now show the individual data points from each treatment with the 95% Confidence Interval of the mean.

    1. eLife Assessment

      The present manuscript by Cordeiro et al., shows convincing evidence that α-mangostin, a xanthone obtained from the fruit of the Garcinia mangostana tree, behaves as a strong activator of the large-conductance (BK) potassium channels; macroscopic currents and single-channel experiments show that α-mangostin produces an increase in the probability of opening, without affecting the single-channel conductance. The authors put forward that α-mangostin activation of the BK channel is state-independent, and molecular docking and mutagenesis suggest that α-mangostin binds to a site in the internal cavity. Additionally, the authors show that α-mangostin can relax arteries, further suggesting the plausibility of the proposed effects of this compound. These are valuable findings that should be of interest to channel biophysicists and physiologists alike.

    2. Reviewer #1 (Public review):

      In this manuscript, the authors aimed to identify the molecular target and mechanism by which α-Mangostin, a xanthone from Garcinia mangostana, produces vasorelaxation that could explain the antihypertensive effects. Building on prior reports of vascular relaxation and ion channel modulation, the authors convincingly show that large-conductance potassium BK channels are the primary site of action. Using electrophysiological, pharmacological, and computational evidence, the authors achieved their aims and showed that BK channels are the critical molecular determinant of mangostin's vasodilatory effects, even though the vascular studies are quite preliminary in nature.

      Strengths:

      (1) The broad pharmacological profiling of mangostin across potassium channel families, revealing BK channels - and the vascular BK-alpha/beta1 complex - as the potently activated target in a concentration-dependent manner.

      (2) Detailed gating analyses showing large negative shifts in voltage-dependence of activation and altered activation and deactivation kinetics.

      (3) High-quality single-channel recordings for open probability and dwell times.

      (4) Convincing activation in reconstituted BKα/β1-Caᵥ nanodomains mimicking physiological conditions and functional proof-of-concept validation in mouse aortic rings.

      Weaknesses are minor:

      (1) Some mutagenesis data (e.g., partial loss at L312A) could benefit from complementary structural validation.

      (2) While Cav-BK nanodomains were reconstituted, direct measurement of calcium signals after mangostin application onto native smooth muscle could be valuable.

      (3) The work has an impact on ion channel physiology and pharmacology, providing a mechanistic link between a natural product and vasodilation. Datasets include electrophysiology traces, mutagenesis scans, docking analyses, and aortic tension recordings. The latter, however, are preliminary in nature.

    3. Reviewer #2 (Public review):

      Summary:

      In the present manuscript, Cordeiro et al. show that α-mangostin, a xanthone obtained from the fruit of the Garcinia mangostana tree, behaves as an agonist of the BK channels. The authors arrive at this conclusion through the effect of mangostin on macroscopic and single-channel currents elicited by BK channels formed by the α subunit and α + β1sununits, as well as αβ1 channels coexpressed with voltage-dependent Ca2+ (CaV1,2) channels. The single-channel experiments show that α-mangostin produces a robust increase in the probability of opening without affecting the single-channel conductance. The authors contend that α-mangostin activation of the BK channel is state-independent and molecular docking and mutagenesis suggest that α-mangostin binds to a site in the internal cavity. Importantly, α-mangostin (10 μM) alleviates the contracture promoted by noradrenaline. Mangostin is ineffective if the contracted muscles are pretreated with the BK toxin iberiotoxin.

      Strengths:

      The set of results combining electrophysiological measurements, mutagenesis, and molecular docking reveals α-mangostin as a potent activator of BK channels and the putative location of the α-mangostin binding site. Moreover, experiments conducted on aortic preparations from mice suggest that α-mangostin can aid in developing drugs to treat a myriad of diverse diseases involving the BK channel.

      Weaknesses:

      Major:

      (1) Although the results indicate that α-mangostin is modifying the closed-open equilibrium, the conclusion that this can be due to a stabilization of the voltage sensor in its active configuration may prove to be wrong. It is more probable that, as has been demonstrated for other activators, the α-mangostin is increasing the equilibrium constant that defines the closed-open reaction (L in the Horrigan, Aldrich allosteric gating model for BK). The paper will gain much if the authors determine the probability of opening in a wide range of voltages, to determine how the drug is affecting (or not), the channel voltage dependence, the coupling between the voltage sensor and the pore, and the closed-open equilibrium (L).

      (2) Apparently, the molecular docking was performed using the truncated structure of the human BK channel. However, it is unclear which one, since the PDB ID given in the Methods (6vg3), according to what I could find, corresponds to the unliganded, inactive PTK7 kinase domain. Be as it may, the apo and Ca2+ bound structures show that there is a rotation and a displacement of the S6 transmembrane domain. Therefore, the positions of the residues I308, L312, and A316 in the closed and open configurations of the BK channel are not the same. Hence, it is expected that the strength of binding will be different whether the channel is closed or open. This point needs to be discussed.

      Minor:

      (1) From Figure 3A, it is apparent that the increase in Po is at the expense of the long periods (seconds) that the channel remains closed. One might suggest that α-mangostin increases the burst periods. It would be beneficial if the authors measured both closed and open dwell times to test whether α-mangostin primarily affects the burst periods.

      (2) In several places, the authors make similarities in the mode of action of other BK activators and α-mangostin; however, the work of Gessner et al. PNAS 2012 indicates that NS1619 and Cym04 interact with the S6/RCK linker, and Webb et al. demonstrated that GoSlo-SR-5-6 agonist activity is abolished when residues in the S4/S5 linker and in the S6C region are mutated. These findings indicate that binding of the agonist is not near the selectivity filter, as the authors' results suggest that α-mangostin binds.

      (3) The sentence starting in line 452 states that there is a pronounced allosteric coupling between the voltage sensors and Ca2+ binding. If the authors are referring to the coupling factor E in the Horrigan-Aldrich gating model, the references cited, in particular, Sun and Horrigan, concluded that the coupling between those sensors is weak.

    4. Reviewer #3 (Public review):

      Summary:

      This research shows that a-mangostin, a proposed nutraceutical, with cardiovascular protective properties, could act through the activation of large conductance potassium permeable channels (BK). The authors provide convincing electrophysiological evidence that the compound binds to BK channels and induces a potent activation, increasing the magnitude of potassium currents. Since these channels are important modulators of the membrane potential of smooth muscle in vascular tissue, this activation leads to muscle relaxation, possibly explaining cardiovascular protective effects.

      Strengths:

      The authors present evidence based on several lines of experiments that a-mangostin is a potent activator of BK channels. The quality of the experiments and the analysis is high and represents an appropriate level of analysis. This research is timely and provides a basis to understand the physiological effects of natural compounds with proposed cardio-protective effects.

      Weaknesses:

      The identification of the binding site is not the strongest point of the manuscript. The authors show that the binding site is probably located in the hydrophobic cavity of the pore and show that point mutations reduce the magnitude of the negative voltage shift of activation produced by a-mangostin. However, these experiments do not demonstrate binding to these sites, and could be explained by allosteric effects on gating induced by the mutations themselves.

    5. Author response:

      We sincerely thank the reviewers and editors for their thoughtful evaluations of our work. We are grateful for the careful reading, constructive critiques, and encouraging comments regarding the electrophysiological analyses, mutagenesis, and vascular experiments. The suggestions provided have been very helpful, and we are working to address these points in our revision to strengthen the manuscript and improve its clarity.

      In revising the manuscript, we plan to clarify several text passages as recommended by the reviewers, and review and refine the discussion for improved precision. Following the suggestions of the reviewers, we plan to perform a number of additional experiments to provide more data for the binding region and for further mechanistic and physiological insight. We will prepare a point-by-point response addressing all issues raised in a detailed rebuttal. Additionally, we will include improvements in the Methods section as suggested by the SciScore core report.

      We appreciate the opportunity to revise our work and thank the reviewers again for their valuable feedback.

    1. eLife Assessment

      The one-carbon tetrahydrofolate metabolism plays a crucial role in producing essential metabolic intermediates. In this study, the authors employ a genetics-based approach to demonstrate that three different metabolic pathways are essential for synthesizing 1C-tetrahydrofolates (1C-THF). Disrupting any of these pathways impairs both growth and virulence. Although the work presented is valuable, the experimental evidence remains incomplete without direct quantification of folate intermediates.

    2. Reviewer #1 (Public review):

      Summary:

      This study identifies three redundant pathways-glycine cleavage system (GCS), serine hydroxymethyltransferase (GlyA), and formate-tetrahydrofolate ligase/FolD-that feed the one-carbon tetrahydrofolate (1C-THF) pool essential for Listeria monocytogenes growth and virulence. Reactivation of the normally inactive fhs gene rescues 1C-THF deficiency, revealing metabolic plasticity and vulnerability for potential antimicrobial targeting

      Strengths:

      (1) Novel evolutionary insight - reversible reactivation of a pseudogene (fhs) shows adaptive metabolic plasticity, relevant for pathogen evolution.

      (2) They systematically combine targeted gene deletions with suppressor screening to dissect the folate/one-carbon network (GCS, GlyA, Fhs/FolD).

      Weaknesses:

      (1) The study infers 1C-THF depletion mostly genetically and indirectly (growth rescue with adenine) without direct quantification of folate intermediates or fluxes. Biochemical confirmation, LC-MS-based metabolomics of folates/1C donors, or isotopic tracing would strengthen mechanistic claims.

      (2) In multiple result sections, the authors report data from technical triplicates but do not mention independent biological replicates (e.g., Figure 2C, Figure 4A-B, Figure 6D). In addition, some results mention statistical significance but without a detailed description of the specific statistical tests used or replicates, such as Figure 2A-C, Figure 2E, and Figure 2G-I.

    3. Reviewer #2 (Public review):

      Summary:

      The manuscript by Freier et al examines the impact of deletion of the glycine cleavage system (GCS) GcvPAB enzyme complex in the facultative intracellular bacterial pathogen Listeria monocytogenes. GcvPAB mediates the oxidative decarboxylation of glycine as a first step in a pathway that leads to the generation of N5, N10-methylene-Tetrahydrofolate (THF) to replenish the 1-carbon THF (1C-THF) pool. 1C-THF species are important for the biosynthesis of purines and pyrimidines as well as for the formation of serine, methionine, and N-formylmethionine, and the authors have previously demonstrated that gcvPAB is important for bacterial replication within macrophages. A significant defect for growth is observed for the gcvPAB deletion mutant in defined media, and this growth defect appears to stem from the sensitivity of the mutant strain to excess glycine, which is hypothesized to further deplete the 1C-THF pool. Selection of suppressor mutations that restored growth of gcvPAB deletion mutants in synthetic media with high glycine yielded mutants that reversed stop codon inactivation of the formate-tetrahydrofolate ligase (fhs) gene, supporting the premise that generation of N10-formyl-THF can restore growth. Mutations within the folk, codY, and glyA genes, encoding serine hydroxymethyltransferase, were also identified, although the functional impact of these mutations is somewhat less clear. Overall, the authors report that their work identifies three pathways that feed the 1C-THF pool to support the growth and virulence of L. monocytogenes and that this work represents the first example of the spontaneous reactivation of a L. monocytogenes gene that is inactivated by a premature stop codon.

      Strengths:

      This is an interesting study that takes advantage of a naturally existing fhs mutant Listeria strain to reveal the contributions of different pathways leading to 1C-THF synthesis. The defects observed for the gcvPAB mutant in terms of intracellular growth and virulence are somewhat subtle, indicating that bacteria must be able to access host sources (such as adenine?) to compensate for the loss of purine and fMet synthesis. Overall, the authors do a nice job of assessing the importance of the pathways identified for 1C-THF synthesis.

      Weaknesses:

      (1) Line 114 and Figure 1: The authors indicate that the gcvPAB deletion forms significantly fewer plaques in addition to forming smaller plaques (although this is a bit hard to see in the plaque images). A reduction in the overall number of plaques sounds like a bacterial invasion defect - has this been carefully assessed? The smaller plaque size makes sense with reduced bacterial replication, but I'm not sure I understand the reduction in plaque number.

      (2) Do other Listeria strains contain the stop codon in fhs? How common is this mutation? That would be interesting to know.

      (3) Based on the observation that fhs+ ΔgcvPAB ΔglyA mutant is only possible to isolate in complex media, and fhs is responsible for converting formate to 1C-THF with the addition of FolD, have the authors thought of supplementing synthetic media with formate and assessing mutant growth?

    4. Reviewer #3 (Public review):

      Summary:

      In this study, Freier et al. demonstrate that 3 distinct metabolic pathways are critical for the synthesis of 1C-THF, a metabolite that is crucial for the growth and virulence of Listeria monocytogenes. Using an elegant suppressor screen, they also demonstrate the hierarchical importance of these metabolic pathways with respect to the biosynthesis of 1C-THF.

      Strengths:

      This study uses elegant bacterial genetics to confirm that 3 distinct metabolic pathways are critical for 1C-THF synthesis in L. monocytogenes, and the lack of either one of these pathways compromises bacterial growth and virulence. The study uses a combination of in vitro growth assays, macrophage-CFU assays, and murine infection models to demonstrate this.

      Weaknesses:

      (1) The primary finding of the study is that the perturbation of any of the 3 metabolic pathways important for the synthesis of 1C-THF results in reduced growth and virulence of L. monocytogenes. However, there is no evidence demonstrating the levels of 1C-THF in the various knockouts and suppressor mutants used in this study. It is important to measure the levels of this metabolite (ideally using mass spectrometry) in the various knockouts and suppressor mutants, to provide strong causality.

      (2) The story becomes a little hard to follow since macrophage-CFU assays and murine infection model data precede the in vitro growth assays. The manuscript would benefit from a reorganization of Figures 2,3, and 4 for better readability and flow of data.

    1. eLife Assessment

      The study highlights development of a multiplex coregulator TR-FRET (CRT) assay that detects ligands with theoretical full agonist, partial agonist, antagonist, and inverse agonist signatures within the same chemical series. The findings are valuable and will have theoretical and practical implications in the subfield, with respect to guiding the design of non-lipogenic liver X receptor (LXR) agonists. The strength of the evidence is solid, whereby the methods, data, and analyses broadly support the claims with only minor weaknesses that can be dealt with through improvements in the data analysis and the discussion. This study will be of interest to experts working in the areas of pharmacology, medicinal chemistry, and drug discovery in Alzheimer's diseases and dementias.

    2. Reviewer #1 (Public review):

      Summary:

      This important study functionally profiled ligands targeting the LXR nuclear receptors using biochemical assays in order to classify ligands according to pharmacological functions. Overall, the evidence is solid, but nuances in the reconstituted biochemical assays and cellular studies and terminology of ligand pharmacology limit the potential impact of the study. This work will be of interest to scientists interested in nuclear receptor pharmacology.

      Strengths:

      (1) The authors rigorously tested their ligand set in CRTs for several nuclear receptors that could display ligand-dependent cross-talk with LXR cellular signaling and found that all compounds display LXR selectivity when used at ~1 µM.

      (2) The authors tested the ligand set for selectivity against two LXR isoforms (alpha and beta). Most compounds were found to be LXRbeta-specific.

      (3) The authors performed extensive LXR CRTs, performed correlation analysis to cellular transcription and gene expression, and classification profiling using heatmap analysis-seeking to use relatively easy-to-collect biochemical assays with purified ligand-binding domain (LBD) protein to explain the complex activity of full-length LXR-mediated transcription.

      Weaknesses:

      (1) The descriptions of some observations lack detail, which limits understanding of some key concepts.

      (2) The presence of endogenous NR ligands within cells may confound the correlation of ligand activity of cellular assays to biochemical assay data.

      (3) The normalization of biochemical assay data could confound the classification of graded activity ligands.

      (4) The presence of >1 coregulator peptide in the biplex (n=2 peptides) CRT (pCRT) format will bias the LBD conformation towards the peptide-bound form with the highest binding affinity, which will impact potency and interpretation of TR-FRET data.

      (5) Correlation graphical plots lack sufficient statistical testing.

      (6) Some of the proposed ligand pharmacology nomenclature is not clear and deviates from classifications used currently in the field (e.g., hard and soft antagonist; weak vs. partial agonist, definition of an inverse agonist that is not the opposite function to an agonist).

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript by Laham and co-workers, the authors profiled structurally diverse LXR ligands via a coregulator TR-FRET (CRT) assay for their ability to recruit coactivators and kick off corepressors, while identifying coregulator preference and LXR isoform selectivity.

      The relative ligand potencies measured via CRT for the two LXR isoforms were correlated with ABCA1 induction or lipogenic activation of SRE, depending on cellular contexts (i.e, astrocytoma or hepatocarcinoma cells). While these correlations are interesting, there is some leeway to improve the quantitative presentation of these correlations. Finally, the CRT signatures were correlated with the structural stabilization of the LXR: coregulator complexes. In aggregate, this study curated a set of LXR ligands with disparate agonism signatures that may guide the design of future nonlipogenic LXR agonists with potential therapeutic applications for cardiovascular disease, Alzheimer's, and type 2 diabetes, without inducing mechanisms that promote fat/lipid production.

      Strengths:

      This study has many strengths, from curating an excellent LXR compound set to the thoughtful design of the CRT and cellular assays. The design of a multiplexed precision CRT (pCRT) assay that detects corepressor displacement as a function of ligand-induced coactivator recruitment is quite impressive, as it allows measurement of ligand potencies to displace corepressors in the presence of coactivators, which cannot be achieved in a regular CRT assay that looks at coactivator recruitment and corepressor dissociation in separate experiments.

      Weaknesses:

      I did not identify any major weaknesses.

    1. eLife Assessment

      This manuscript describes a valuable screening approach to identifying nanobodies with the potential to modulate gene expression via epigenetic regulators. While the concept is of interest and the screening strategy is well designed, the current evidence supporting mechanistic specificity remains incomplete.

    2. Reviewer #1 (Public review):

      Summary:

      This study presents a high-throughput screening platform to identify nanobodies capable of recruiting chromatin regulators and modulating gene expression. The authors utilize a yeast display system paired with mammalian reporter assays to validate candidate nanobodies, aiming to create a modular resource for synthetic epigenetic control.

      Strengths:

      (1) The overall screening design combining yeast display with mammalian functional assays is innovative and scalable.

      (2) The authors demonstrate proof-of-concept that nanobody-based recruitment can repress or activate reporter expression.

      (3) The manuscript contributes to the growing toolkit for epigenome engineering.

      Weaknesses:

      (1) The manuscript does not investigate which endogenous factors are recruited by the nanobodies. While repression activity is demonstrated at the reporter level, there is no mechanistic insight into what proteins are being brought to the target site by each nanobody. This limits the interpretability and generalizability of the findings. Related to this, Figure S1B reports sequence similarity among complementarity-determining regions (CDRs) of nanobodies that scored highly in the DNMT3A screen. However, it remains unclear whether this similarity reflects convergence on a common molecular target or is coincidental. Without functional or proteomic validation, the relationship between sequence motifs and effector recruitment remains speculative.

      (2) The epigenetic consequences of nanobody recruitment are also left unexplored. Despite targeting epigenetic regulators, the study does not assess changes such as DNA methylation or histone modifications. This makes it difficult to interpret whether the observed reporter repression is due to true chromatin remodeling or secondary effects.

    3. Reviewer #2 (Public review):

      Summary:

      Wan, Thurm et al. use a yeast nanobody library that is thought to have diverse binders to isolate those that specifically bind to proteins of their interest. The yeast nanobody library collection in general carries enormous potential, but the challenge is to isolate binders that have specific activity. The authors posit that one reason for this isolation challenge is that the negative binders, in general, dampen the signal from the positive binders. This is a classic screening problem (one that geneticists have faced over decades) and, in general, underscores the value of developing a good secondary screen. Over many years, the authors have developed an elegant platform to carry out high-throughput silencing-based assays, thus creating the perfect secondary screen platform to isolate nanobodies that bind to chromatin regulators.

      Strengths:

      Highlights the enormous value of a strong secondary screen when identifying binders that can be isolated from the yeast nanobody library. This insight is generalizable, and I expect that this manuscript should help inspire many others to design such approaches.

      Provides new cell-based reagents that can be used to recruit epigenetic activators or repressors to modulate gene expression at target loci.

      Weaknesses:

      The authors isolate DNMT3A and TET1/2 enzymes directly from cell lysates and bind these proteins to beads. It is not clear what proteins are, in fact, bound to beads at the end of the IP. Epigenetic repressors are part of complexes, and it would be helpful to know if the IP is specific and whether the IP pulls down only DNMT3A or other factors. While this does not change the underlying assumptions about the screen, it does alter the authors' conclusions about whether the nanobody exclusively recruits DNMT3A or potentially binds to other co-factors.

      Using IP-MS to validate the pull-down would be a helpful addition to the manuscript, although one could very reasonably make the case that other co-factors get washed away during the course of the selection assay. Nevertheless, if there are co-factors that are structural and remain bound, these are likely to show up in the MS experiment.

    1. eLife Assessment

      This important study reports on the relationships between cerebral haemodynamics and a number of factors that relate to genetics, lifestyle, and medical history using data from a large cohort. Compelling evidence suggests that brief arterial spin labelling MRI acquisition can lead to both expected observations about brain health, as manifested in cerebral blood flow, and biomarkers for use in diagnosis and treatment monitoring. The results can be used as a starting point for hypothesis generation and further evaluation of conditions expected to affect haemodynamics in the brain.

    2. Reviewer #1 (Public review):

      Summary:

      In this work, Okell et al. describe the imaging protocol and analysis pipeline pertaining to the arterial spin labeling (ASL) MRI protocol acquired as part of the UK Biobank imaging study. In addition, they present preliminary analyses of the first 7000+ subjects in whom ASL data were acquired, and this represents the largest such study to date. Careful analyses revealed expected associations between ASL-based measures of cerebral hemodynamics and non-imaging-based markers, including heart and brain health, cognitive function, and lifestyle factors. As it measures physiology and not structure, ASL-based measures may be more sensitive to these factors compared with other imaging-based approaches.

      Strengths:

      This study represents the largest MRI study to date to include ASL data in a wide age range of adult participants. The ability to derive arterial transit time (ATT) information in addition to cerebral blood flow (CBF) is a considerable strength, as many studies focus only on the latter.

      Some of the results (e.g., relationships with cardiac output and hypertension) are known and expected, while others (e.g., lower CBF and longer ATT correlating with hearing difficulty in auditory processing regions) are more novel and intriguing. Overall, the authors present very interesting physiological results, and the analyses are conducted and presented in a methodical manner.

      The analyses regarding ATT distributions and the potential implications for selecting post-labeling delays (PLD) for single PLD ASL are highly relevant and well-presented.

      Weaknesses:

      At a total scan duration of 2 minutes, the ASL sequence utilized in this cohort is much shorter than that of a typical ASL sequence (closer to 5 minutes as mentioned by the authors). However, this implementation also included multiple (n=5) PLDs. As currently described, it is unclear how any repetitions were acquired at each PLD and whether these were acquired efficiently (i.e., with a Look-Locker readout) or whether individual repetitions within this acquisition were dedicated to a single PLD. If the latter, the number of repetitions per PLD (and consequently signal-to-noise-ratio, SNR) is likely to be very low. Have the authors performed any analyses to determine whether the signal in individual subjects generally lies above the noise threshold? This is particularly relevant for white matter, which is the focus of several findings discussed in the study.

      Hematocrit is one of the variables regressed out in order to reduce the effect of potential confounding factors on the image-derived phenotypes. The effect of this, however, may be more complex than accounting for other factors (such as age and sex). The authors acknowledge that hematocrit influences ASL signal through its effect on longitudinal blood relaxation rates. However, it is unclear how the authors handled the fact that the longitudinal relaxation of blood (T1Blood) is explicitly needed in the kinetic model for deriving CBF from the ASL data. In addition, while it may reduce false positives related to the relationships between dietary factors and hematocrit, it could also mask the effects of anemia present in the cohort. The concern, therefore, is two-fold: (1) Were individual hematocrit values used to compute T1Blood values? (2) What effect would the deconfounding process have on this?

      The authors leverage an observed inverse association between white matter hyperintensity volume and CBF as evidence that white matter perfusion can be sensitively measured using the imaging protocol utilized in this cohort. The relationship between white matter hyperintensities and perfusion, however, is not yet fully understood, and there is disagreement regarding whether this structural imaging marker necessarily represents impaired perfusion. Therefore, it may not be appropriate to use this finding as support for validation of the methodology.

    3. Reviewer #2 (Public review):

      Summary:

      Okell et al. report the incorporation of arterial spin-labeled (ASL) perfusion MRI into the UK Biobank study and preliminary observations of perfusion MRI correlates from over 7000 acquired datasets, which is the largest sample of human perfusion imaging data to date. Although a large literature already supports the value of ASL MRI as a biomarker of brain function, this important study provides compelling evidence that a brief ASL MRI acquisition may lead to both fundamental observations about brain health as manifested in CBF and valuable biomarkers for use in diagnosis and treatment monitoring.

      ASL MRI noninvasively quantifies regional cerebral blood flow (CBF), which reflects both cerebrovascular integrity and neural activity, hence serves as a measure of brain function and a potential biomarker for a variety of CNS disorders. Despite a highly abbreviated ASL MRI protocol, significant correlations with both expected and novel demographic, physiological, and medical factors are demonstrated. In many such cases, ASL was also more sensitive than other MRI-derived metrics. The ASL MRI protocol implemented also enables quantification of arterial transit time (ATT), which provides stronger clinical correlations than CBF in some factors. The results demonstrate both the feasibility and the efficacy of ASL MRI in the UK Biobank imaging study, which expects to complete ASL MRI in up to 60,000 richly phenotyped individuals. Although a large literature already supports the value of ASL MRI as a biomarker of brain function, this important study provides compelling evidence that a brief ASL MRI acquisition may lead to both fundamental observations about brain health as manifested in CBF and valuable biomarkers for use in diagnosis and treatment monitoring.

      Strengths:

      A key strength of this study is the use of an ASL MRI protocol incorporating balanced pseudocontinuous labeling with a background-suppressed 3D readout, which is the current state-of-the-art. To compensate for the short scan time, voxel resolution was intentionally only moderate. The authors also elected to acquire these data across five post-labeling delays, enabling ATT and ATT-corrected CBF to be derived using the BASIL toolbox, which is based on a variational Bayesian framework. The resulting CBF and ATT maps shown in Figure 1 are quite good, especially when combined with such a large and deeply phenotyped sample.

      Another strength of the study is the rigorous image analysis approach, which included covariation for a number of known CBF confounds as well as correction for motion and scanner effects. In doing so, the authors were able to confirm expected effects of age, sex, hematocrit, and time of day on CBF values. These observations lend confidence in the veracity of novel observations, for example, significant correlations between regional ASL parameters and cardiovascular function, height, alcohol consumption, depression, and hearing, as well as with other MRI features such as regional diffusion properties and magnetic susceptibility. They also provide valuable observations about ATT and CBF distributions across a large cohort of middle-aged and older adults.

      Weaknesses:

      This study primarily serves to illustrate the efficacy and potential of ASL MRI as an imaging parameter in the UK Biobank study, but some of the preliminary observations will be hypothesis-generating for future analyses in larger sample sizes. However, a weakness of the manuscript is that some of the reported observations are difficult to follow. In particular, the associations between ASL and resting fMRI illustrated in Figure 7 and described in the accompanying Results text are difficult to understand. It could also be clearer whether the spatial maps showing ASL correlates of other image-derived phenotypes in Figure 6B are global correlations or confined to specific regions of interest. Finally, while addressing partial volume effects in gray matter regions by covarying for cortical thickness is a reasonable approach, the Methods section seems to imply that a global mean cortical thickness is used, which could be problematic given that cortical thickness changes may be localized.

    4. Reviewer #3 (Public review):

      Summary:

      This is an extremely important manuscript in the evolution of cerebral perfusion imaging using Arterial Spin Labelling (ASL). The number of subjects that were scanned has provided the authors with a unique opportunity to explore many potential associations between regional cerebral blood flow (CBF) and clinical and demographic variables.

      Strengths:

      The major strength of the manuscript is the access to an unprecedentedly large cohort of subjects. It demonstrates the sensitivity of regional tissue blood flow in the brain as an important marker of resting brain function. In addition, the authors have demonstrated a thorough analysis methodology and good statistical rigour.

      Weaknesses:

      This reviewer did not identify any major weaknesses in this work.

    1. eLife Assessment

      This important study presents convincing evidence that uncovers a novel signaling axis impacting the post-mating response in females of the brown planthopper. The findings open several avenues for testing the molecular and neurobiological mechanisms of mating behavior in insects, although broad concerns remain about the relevance of some claims.

    2. Reviewer #1 (Public review):

      In this work, Zhang et al, through a series of well-designed experiments, present a comprehensive study exploring the roles of the neuropeptide Corazonin (CRZ) and its receptor in controlling the female post-mating response (PMR) in the brown planthopper (BPH) Nilaparvata lugen and Drosophila melanogaster. Through a series of behavioural assays, micro-injections, gene knockdowns, Crispr/Cas gene editing, and immunostaining, the authors show that both CRZ and CrzR play a vital role in the female post-mating response, with impaired expression of either leading to quicker female remating and reduced ovulation in BPH. Notably, the authors find that this signaling is entirely endogenous in BPH females, with immunostaining of male accessory glands (MAGs) showing no evidence of CRZ expression. Further, the authors demonstrate that while CRZ is not expressed in the MAGs, BPH males with Crz knocked out show transcriptional dysregulation of several seminal fluid proteins and functionally link this dysregulation to an impaired PMR in BPH. In relation, the authors also find that in CrzR mutants, the injection of neither MAG extracts nor maccessin peptide triggered the PMR in BPH females. Finally, the authors extend this study to D. melanogaster, albeit on a more limited scale, and show that CRZ plays a vital role in maintaining PMR in D. melanogaster females with impaired CRZ signaling, once again leading to quicker female remating and reduced ovulation. The authors must be commended for their expansive set of complementary experiments. The manuscript is also generally well written. Given the seemingly conserved nature of CRZ, this work is a significant addition to the literature, opening several avenues for testing the molecular and neurobiological mechanisms in which CRZ triggers the PMR.

      However, there are some broad concerns/comments I had with this manuscript. The authors provide clear evidence that CRZ signaling plays a major role in the PMR of D. melanogaster, however, they provide no evidence that CRZ signaling is endogenous, as they did not check for expression in the MAGs of D. melanogaster males. Additionally, while the authors show that manipulating Crz in males leads to dysregulated seminal fluid expression and impaired PMR in BPH, the authors also find that CRZ injection in males in and of itself impairs PMR in BPH. The authors do not really address what this seemingly contradictory result could mean. While a lot of the figures have replicate numbers, the authors do not factor in replicate as an effect into their models, which they ideally should do.

      Finally, while the discussion is generally well-written, it lacks a broader conclusion about the wider implications of this study and what future work building on this could look like.

    3. Reviewer #2 (Public review):

      Summary:

      The work presented by Zhang and coauthors in this manuscript presents the study of the neuropeptide corazonin in modulating the post-mating response of the brown planthopper, with further validation in Drosophila melanogaster. To obtain their results, the authors used several different techniques that orthogonally demonstrate the involvement of corazonin signalling in regulating the female post-mating response in these species.

      They first injected synthetic corazonin peptide into female brown planthoppers, showing altered mating receptivity in virgin females and a higher number of eggs laid after mating. The role of corazonin in controlling these post-mating traits has been further validated by knocking down the expression of the corazonin gene by RNA interference and through CRISPR-Cas9 mutagenesis of the gene. Further proof of the importance of corazonin signalling in regulating the female post-mating response has been achieved by knocking down the expression or mutagenizing the gene coding for the corazonin receptor.

      Similar results have been obtained in the fruit fly Drosophila melanogaster, suggesting that corazonin signalling is involved in controlling the female post-mating response in multiple insect species.<br /> Notably, the authors also show that corazonin controls gene expression in the male accessory glands and that disruption of this pathway in males compromises their ability to elicit normal post-mating responses in their mates.

      Strengths:

      The study of the signalling pathways controlling the female post-mating response in insects other than Drosophila is scarce, and this limits the ability of biologists to draw conclusions about the evolution of the post-mating response in female insects. This is particularly relevant in the context of understanding how sexual conflict might work at the molecular and genetic levels, and how, ultimately, speciation might occur at this level. Furthermore, the study of the post-mating response could have practical implications, as it can lead to the development of control techniques, such as sterilization agents.

      The study, therefore, expands the knowledge of one of the signalling pathways that control the female post-mating response, the corazonin neuropeptide. This pathway is involved in controlling the post-mating response in both Nilaparvata lugens (the brown planthopper) and Drosophila melanogaster, suggesting its involvement in multiple insect species.

      The study uses multiple molecular approaches to convincingly demonstrate that corazonin controls the female post-mating response.

      Weaknesses:

      The data supporting the main claims of the manuscript are solid and convincing. The statistical analysis of some of the data might be improved, particularly by tailoring the analysis to the type of data that has been collected.

      In the case of the corazonin effect in females, all the data are coherent; in the case of CRISPR-Cas9-induced mutagenesis, the analysis of the behavioural trait in heterozygotes might have helped in understanding the haplosufficiency of the gene and would have further proved the authors' point.

      Less consistency was achieved in males (Figure 5): the authors show that injection of CRZ and RNAi of crz, or mutant crz, has the same effect on male fitness. However, the CRZ injection should activate the pathway, and crz RNAi and mutant crz should inhibit the pathway, yet they have the same effect. A comment about this discrepancy would have improved the clarity of the manuscript, pointing to new points that need to be clarified and opening new scientific discussion.

    1. eLife Assessment

      This valuable study addresses a critical and timely question regarding the role of a subpopulation of cortical interneurons (Chrna2-expressing Martinotti cells) in motor learning and cortical dynamics. However, while some of the behavior and imaging data are impressive, the small sample sizes and incomplete behavioral and activity analyses make interpretation difficult; therefore, they are insufficient to support the central conclusions. The study may be of interest to neuroscientists studying cortical neural circuits, motor learning, and motor control.

    2. Reviewer #1 (Public review):

      In this study, the authors investigated a specific subtype of SST-INs (layer 5 Chrna2-expressing Martinotti cells) and examined its functional role in motor learning. Using endoscopic calcium imaging combined with chemogenetics, they showed that activation of Chrna2 cells reduces the plasticity of pyramidal neuron (PyrN) assemblies but does not affect the animals' performance. However, activating Chrna2 cells during re-training improved performance. The authors claim that activating Chrna2 cells likely reduces PyrN assembly plasticity during learning and possibly facilitates the expression of already acquired motor skills.

      There are many major issues with the study. The findings across experiments are inconsistent, and it is unclear how the authors performed their analyses or why specific time points and comparisons were chosen. The study requires major re-analysis and additional experiments to substantiate its conclusions.

      Major Points:

      (1a) Behavior task - the pellet-reaching task is a well-established paradigm in the motor learning field. Why did the authors choose to quantify performance using "success pellets per minute" instead of the more conventional "success rate" (see PMID 19946267, 31901303, 34437845, 24805237)? It is also confusing that the authors describe sessions 1-5 as being performed on a spoon, while from session 6 onward, the pellets are presented on a plate. However, in lines 710-713, the authors define session 1 as "naïve," session 2 as "learning," session 5 as "training," and "retraining" as a condition in which a more challenging pellet presentation was introduced. Does "naïve session 1" refer to the first spoon session or to session 6 (when the food is presented on a plate)? The same ambiguity applies to "learning session 2," "training session 5," and so on. Furthermore, what criteria did the authors use to designate specific sessions as "learning" versus "training"? Are these definitions based on behavioral performance thresholds or some biological mechanisms? Clarifying these distinctions is essential for interpreting the behavioral results.

      (1b) Judging from Figures 1F and 4B, even in WT mice, it is not convincing that the animals have actually learned the task. In all figures, the mice generally achieve ~10-20 pellets per minute across sessions. The only sessions showing slightly higher performance are session 5 in Figure 1F ("train") and sessions 12 and 13 in Figure 4B ("CLZ"). In the classical pellet-reaching task, animals are typically trained for 10-12 sessions (approximately 60 trials per session, one session per day), and a clear performance improvement is observed over time. The authors should therefore present performance data for each individual session to determine whether there is any consistent improvement across days. As currently shown, performance appears largely unchanged across sessions, raising doubts about whether motor learning actually occurred.

      (1c) The authors also appear to neglect existing literature on the role of SST-INs in motor learning and local circuit plasticity (e.g., PMID 26098758, 36099920). Although the current study focuses on a specific subpopulation of SST-INs, the results reported here are entirely opposite to those of previous studies. The authors should, at a minimum, acknowledge these discrepancies and discuss potential reasons for the differing outcomes in the Discussion section.

      (2a) Calcium imaging - The methodology for quantifying fluorescence changes is confusing and insufficiently described. The use of absolute ΔF values ("detrended by baseline subtraction," lines 565-567) for analyses that compare activity across cells and animals (e.g., Figure 1H) is highly unconventional and problematic. Calcium imaging is typically reported as ΔF/F₀ or z-scores to account for large variations in baseline fluorescence (F₀) due to differences in GCaMP expression, cell size, and imaging quality. Absolute ΔF values are uninterpretable without reference to baseline intensity - for example, a ΔF of 5 corresponds to a 100% change in a dim cell (F₀ = 5) but only a 1% change in a bright cell (F₀ = 500). This issue could confound all subsequent population-level analyses (e.g., mean or median activity) and across-group comparisons. Moreover, while some figures indicate that normalization was performed, the Methods section lacks any detailed description of how this normalization was implemented. The critical parameters used to define the baseline are also omitted. The authors should reprocess the imaging data using a standardized ΔF/F₀ or z-score approach, explicitly define the baseline calculation procedure, and revise all related figures and statistical analyses accordingly.

      (2b) Figure 1G - It is unclear why neural activity during successful trials is already lower one second before movement onset. Full traces with longer duration before and after movement onset should also be shown. Additionally, only data from "session 2 (learning)" and a single neuron are presented. The authors should present data across all sessions and multiple neurons to determine whether this observation is consistent and whether it depends on the stage of learning.

      (2c) Figure 1H - The authors report that chemogenetic activation of Chrna2 cells induces differential changes in PyrN activity between successful and failed trials. However, one would expect that activating all Chrna2 cells would strongly suppress PyrN activity rather than amplifying the activity differences between trials. The authors should clarify the mechanism by which Chrna2 cell activation could exaggerate the divergence in PyrN responses between successful and failed trials. Perhaps, performing calcium imaging of Chrna2 cells themselves during successful versus failed trials would provide insight into their endogenous activity patterns and help interpret how their activation influences PyrN activity during successful and failed trials.

      (2d) Figure 1H - Also, in general, the Cre⁺ (red) data points appear consistently higher in activity than the Cre⁻ (black) points. This is counterintuitive, as activating Chrna2 cells should enhance inhibition and thereby reduce PyrN activity. The authors should clarify how Cre⁺ animals exhibit higher overall PyrN activity under a manipulation expected to suppress it. This discrepancy raises concerns about the interpretation of the chemogenetic activation effects and the underlying circuit logic.

      (3) The statistical comparisons throughout the manuscript are confusing. In many cases, the authors appear to perform multiple comparisons only among the N, L, T, and R conditions within the WT group. However, the central goal of this study should be to assess differences between the WT and hM3D groups. In fact, it is unclear why the authors only provide p-values for some comparisons but not for the majority of the groups.

      (4a) Figure 4 - It is hard to understand why the authors introduce LFP experiments here, and the results are difficult to interpret in isolation. The authors should consider combining LFP recordings with calcium imaging (as in Figure 1) or, alternatively, repeating calcium imaging throughout the entire re-training period. This would provide a clearer link between circuit activity and behavior and strengthen the conclusions regarding Chrna2 cell function during re-training.

      (4b) It is unclear why CLZ has no apparent effect in session 11, yet induces a large performance increase in sessions 12 and 13. Even then, the performance in sessions 12 and 13 (~30 successful pellets) is roughly comparable to Session 5 in Figure 1F. Given this, it is questionable whether the authors can conclude that Chrna2 cell activation truly facilitates previously acquired motor skills?

      (5) Figure 5 - The authors report decreased performance in the pasta-handling task (presumably representing a newly learned skill) but observe no difference in the pellet-reaching task (presumably an already acquired skill). This appears to contradict the authors' main claim that Chrna2 cell activation facilitates previously acquired motor skills.

      (6) Supplementary Figure 1 - The c-fos staining appears unusually clean. Previous studies have shown that even in home-cage mice, there are substantial numbers of c-fos⁺ cells in M1 under basal conditions (PMID 31901303, 31901303). Additionally, the authors should present Chrna2 cell labeling and c-fos staining in separate channels. As currently shown, it is difficult to determine whether the c-fos⁺ cells are truly Chrna2 cells⁺.

      Overall, the authors selectively report statistical comparisons only for findings that support their claims, while most other potentially informative comparisons are omitted. Complete and transparent reporting is necessary for proper interpretation of the data.

    3. Reviewer #2 (Public review):

      Summary:

      In this manuscript, Malfatti et al. study the role of Chrna2 Martinotti cells (Mα2 cells), a subset of SST interneurons, for motor learning and motor cortex activity. The authors trained mice on a forelimb prehension task while recording neuronal activity of pyramidal cells using calcium imaging with a head-mounted miniscope. While chemogenetically increasing Mα2 cell activity did not affect motor learning, it changed pyramidal cell activity such that activity peaks became sharper and differently timed than in control mice. Moreover, co-active neuronal assemblies become more stable with a smaller spatial distribution. Increasing Mα2 cell activity in previously trained mice did increase performance on the prehension task and led to increased theta and gamma band activity in the motor cortex. On the other hand, genetic ablation of Mα2 cells affected fine motor movements on a pasta handling task while not affecting the prehension task.

      Strengths:

      The proposed question of how Chrna2-expressing SST interneurons affect motor learning and motor cortex activity is important and timely. The study employs sophisticated approaches to record neuronal activity and manipulate the activity of a specific neuronal population in behaving mice over the course of motor learning. The authors analyze a variety of neuronal activity parameters, comparing different behavior trials, stages of learning, and the effects of Mα2 cell activation. The analysis of neuronal assembly activity and stability over the course of learning by tracking individual neurons throughout the imaging sessions is notable, since technically challenging, and yielded the interesting result that neuronal assemblies are more stable when activating Mα2 cells.

      Overall, the study provides compelling evidence that Mα2 cells regulate certain aspects of motor behaviors, likely by shaping circuit activity in the motor cortex.

      Weaknesses:

      The main limitation of the study lies in its small sample sizes and the absence of key control experiments, which substantially weaken the strength of the conclusions.

      Core findings of this paper, such as the lack of effect of Mα2 cell activation on motor learning, as well as the altered neuronal activity, rely ona sample size of n=3 mice per condition, which is likely underpowered to detect differences in behavior and contributes to the somewhat disconnected results on calcium activity, activity timing, and neuronal assembly activity.

      More comprehensive analyses and data presentation are also needed to substantiate the results. For example, examining calcium activity and behavioral performance on a trial-by-trial basis could clarify whether closely spaced reaching attempts influence baseline signals and skew interpretation.

      The study uses cre-negative mice as controls for hM3Dq-mediated activation, which does not account for potential effects of Cre-dependent viral expression that occur only in Cre-positive mice.

      This important control would be necessary to substantiate the conclusion that it is increased Mα2 cell activity that drives the observed changes in behavior and cortical activity.

    1. eLife Assessment

      This valuable study shows that regions of the human auditory cortex that respond strongly to voices are also sensitive to vocalizations from closely related primate species. The study is methodologically solid, though additional analyses - particularly those isolating the acoustic features that differentiate chimpanzee from bonobo calls - would further strengthen the conclusions. With additional analyses and discussions, the work has the potential to offer key insights into the evolutionary continuity of voice processing and would be of interest to researchers studying auditory processing and evolutionary neuroscience in general.

    2. Reviewer #1 (Public review):

      Summary:

      This study investigates how human temporal voice areas (TVA) respond to vocalizations from nonhuman primates. Using functional MRI during a species-categorization task, the authors compare neural responses to calls from humans, chimpanzees, bonobos, and macaques while modeling both acoustic and phylogenetic factors. They find that bilateral anterior TVA regions respond more strongly to chimpanzee than to other nonhuman primate vocalizations, suggesting that these regions are sensitive not only to human voices but also to acoustically and evolutionarily related sounds.

      The work provides important comparative evidence for continuity in primate vocal communication and offers a strong empirical foundation for modeling how specific acoustic features drive TVA activity.

      Strengths:

      ­(1) Comparative scope: The inclusion of four primate species, including both great apes and monkeys, provides a rare and valuable cross-species perspective on voice processing.

      ­(2) Methodological rigor: Acoustic and phylogenetic distances are carefully quantified and incorporated into the analyses.

      ­(4) Neuroscientific significance: The finding of TVA sensitivity to chimpanzee calls supports the view that human voice-selective regions are evolutionarily tuned to certain acoustic features shared across primates.

      ­(4) Clear presentation: The study is well organized, the stimuli well controlled, and the imaging analyses transparent and replicable.

      ­(5) Theoretical contribution: The results advance understanding of the neural bases of voice perception and the evolutionary roots of voice sensitivity in the human brain.

      Weaknesses:

      ­(1) Acoustic-phylogenetic confound: The design does not fully disentangle acoustic similarity from phylogenetic proximity, as species co-vary along both dimensions. A promising way to address this would be to include an additional model focusing on the acoustic features that specifically differentiate bonobo from chimpanzee calls, which share equal phylogenetic distance to humans.

      ­(2) Selectivity vs. sensitivity: Without non-vocal control sounds, the study cannot determine whether TVA responses reflect true selectivity for primate vocalizations or general auditory sensitivity.<br /> ­<br /> (3) Task demands: The use of an active categorization task may engage additional cognitive processes beyond auditory perception; a passive listening condition would help clarify the contribution of attention and task performance.

      ­(4) Figures and presentation: Some results are partially redundant; keeping only the most representative model figure in the main text and moving others to the Supplementary Material would improve clarity.

    3. Reviewer #2 (Public review):

      Summary:

      This study investigated how the human brain responds to vocalizations from multiple primate species, including humans, chimpanzees, bonobos, and rhesus macaques. The central finding - that subregions of the temporal voice areas (TVA), particularly in the bilateral anterior superior temporal gyrus, show enhanced responses to chimpanzee vocalizations - suggests a potential neural sensitivity to calls from phylogenetically close nonhuman primates.

      Strengths:

      The authors employed three analytical models to consistently demonstrate activation in the anterior superior temporal gyrus that is specific to chimpanzee calls. The methodology was logical and robust, and the results supporting these findings appear solid.

      Weakness:

      The interpretation of the findings in this paper regarding the evolutionary continuity of voice processing lacks sufficient evidence. A simple explanation is that the observed effects can be attributed to the similarity in low-level acoustic features, rather than effects specific to phylogenetically close species. The authors only tested vocalizations from three non-human primate species, other than humans. In this case, the species specificity of the effect does not fully represent the specificity of evolutionary relatedness.

    4. Reviewer #3 (Public review):

      Summary:

      Ceravolo et al. employed functional magnetic resonance imaging (fMRI) to examine how the temporal voice areas (TVA) in the human brain respond to vocalizations from different nonhuman primate species. Their findings reveal that the human TVA is not only responsible for human vocalizations but also exhibits sensitivity to the vocalizations of other primates, particularly chimpanzee vocalizations sharing acoustic similarities with human voices, which offers compelling evidence for cross-species vocal processing in the human auditory system. Overall, the study presents intellectually stimulating hypotheses and demonstrates methodological originality. However, the current findings are not yet solid enough to fully support the proposed claims, and the presentation could be enhanced for clarity and impact.

      Strengths:

      The study presents intellectually stimulating hypotheses and demonstrates methodological originality.

      Weaknesses:

      (1) The analysis of the fMRI data does not account for the participants' behavioral performance, specifically their reaction times (RTs) during the species categorization task.

      (2) The figure organization/presentation requires significant revision to avoid confusion and redundancy.

    1. eLife Assessment

      This valuable simulation study proposes a new coarse-grained model to explain the effects of CpG methylation on nucleosome wrapping energy. The model accurately reproduces the all-atom molecular dynamics simulation data, and the evidence to support the claims in the paper is solid. This work will be of interest to researchers working on gene regulation, mechanisms of DNA methylation and effects of DNA methylation on nucleosome positioning.

    2. Reviewer #1 (Public review):

      In this manuscript, the authors used a coarse-grained DNA model (cgNA+) to explore how DNA sequences and CpG methylation/hydroxymethylation influence nucleosome wrapping energy and the probability density of optimal nucleosomal configuration. Their findings indicate that both methylated and hydroxymethylated cytosines lead to increased nucleosome wrapping energy. Additionally, the study demonstrates that methylation of CpG islands increases the probability of nucleosome formation.

      The major strength of this method is that the model explicitly includes the phosphate group as DNA-histone binding site constraints, enhancing CG model accuracy and computational efficiency and allowing comprehensive calculations of DNA mechanical properties and deformation energies.

      The revised version has addressed the concerns raised previously, significantly strengthening the study.

    3. Reviewer #2 (Public review):

      Summary:

      This study uses a coarse-grained model for double stranded DNA, cgNA+, to assess nucleosome sequence affinity. cgNA+ coarse-grains DNA on the level of bases and accounts also explicitely for the positions of the backbone phosphates. It has been proven to reproduce all-atom MD data very accurately. It is also ideally suited to be incorporated into a nucleosome model because it is known that DNA is bound to the protein core of the nucleosome via the phosphates.

      It is still unclear whether this harmonic model parametrized for unbound DNA is accurate enough to describe DNA inside the nucleosome. Previous models by other authors, using more coarse-grained models of DNA, have been rather successful in predicting base pair sequence dependent nucleosome behavior. This is at least the case as long as DNA shape is concerned whereas assessing the role of DNA bendability (something this paper focuses on) has been consistingly challenging in all nucleosome models to my knowledge.

      It is thus of major interest whether this more sophisticated model is also more successful in handling this issue. As far as I can tell the work is technically sound and properly accounts for not only the energy required in wrapping DNA but also entropic effects, namely the change in entropy that DNA experiences when going from the free state to the bound state. The authors make an approximation here which seems to me to be a reasonable first step.

      Of interest is also that the authors have the parameters at hand to study the effect of methylation of CpG-steps. This is especially interesting as this allows to study a scenario where changes in the physical properties of base pair steps via methylation might influence nucleosome positioning and stability in a cell-type specific way.

      Overall, this is an important contribution to the questions of how sequence affects nucleosome positioning and affinity. The findings suggest that cgNA+ has something new to offer. But the problem is complex, also on the experimental side, so many questions remain open. Despite of this, I highly recommend publication of this manuscript.

      Strengths:

      The authors use their state-of-the-art coarse grained DNA model which seems ideally suited to be applied to nucleosomes as it accounts explicitly for the backbone phosphates.

      Weaknesses:

      The authors introduce penalty coefficients c_i to avoid steric clashes between the two DNA turns in the nucleosome. This requires c_i-values that are so high that standard deviations in the fluctuations of the simulation are smaller than in the experiments.

    4. Reviewer #3 (Public review):

      Summary:

      In this study, authors utilize biophysical modeling to investigate differences in free energies and nucleosomal configuration probability density of CpG islands and nonmethylated regions in the genome. Toward this goal, they develop and apply the cgNA+ coarse-grained model, an extension of their prior molecular modeling framework.

      Strengths:

      The study utilizes biophysical modeling to gain mechanistic insight into nucleosomal occupancy differences in CpG and nonmethylated regions in the genome.

      Weaknesses:

      Although the overall study is interesting, the manuscripts need more clarity in places. Moreover, the rationale and conclusion for some of the analyses are not well described.

      Comments on revised version:

      The authors have addressed my concerns.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors used a coarse-grained DNA model (cgNA+) to explore how DNA sequences and CpG methylation/hydroxymethylation influence nucleosome wrapping energy and the probability density of optimal nucleosomal configuration. Their findings indicate that both methylated and hydroxymethylated cytosines lead to increased nucleosome wrapping energy. Additionally, the study demonstrates that methylation of CpG islands increases the probability of nucleosome formation.

      Strengths:

      The major strength of this method is the model explicitly includes phosphate group as DNA-histone binding site constraints, enhancing CG model accuracy and computational efficiency and allowing comprehensive calculations of DNA mechanical properties and deformation energies.

      Weaknesses:

      A significant limitation of this study is that the parameter sets for the methylated and hydroxymethylated CpG steps in the cgNA+ model are derived from all-atom molecular dynamics (MD) simulations that use previously established force field parameters for modified cytosines (P´erez A, et al. Biophys J. 2012; Battistini, et al. PLOS Comput Biol. 2021). These parameters suggest that both methylated and hydroxymethylated cytosines increase DNA stiffness and nucleosome wrapping energy, which could predispose the coarse-grained model to replicate these findings. Notably, conflicting results from other all-atom MD simulations, such as those by Ngo T in Nat. Commun. 2016, shows that hydroxymethylated cytosines increase DNA flexibility, contrary to methylated cytosines. If the cgNA+ model were trained on these later parameters or other all-atom MD force fields, different conclusions might be obtained regarding the effects of methylated and hydroxymethylation on nucleosome formation.

      Despite the training parameters of the cgNA+ model, the results presented in the manuscript indicate that methylated cytosines increase both DNA stiffness and nucleosome wrapping energy. However, when comparing nucleosome occupancy scores with predicted nucleosome wrapping energies and optimal configurations, the authors find that methylated CGIs exhibit higher nucleosome occupancies than unmethylated ones, which seems to contradict the expected relationship where increased stiffness should reduce nucleosome formation affinity. In the manuscript, the authors also admit that these conclusions “apparently runs counter to the (perhaps naive) intuition that high nucleosome forming affinity should arise for fragments with low wrapping energy”. Previous all-atom MD simulations (P´erez A, et al. Biophys J. 2012; Battistini, et al. PLOS Comput Biol. 202; Ngo T, et al. Nat. Commun. 20161) show that the stiffer DNA upon CpG methylation reduces the affinity of DNA to assemble into nucleosomes or destabilizes nucleosomes. Given these findings, the authors need to address and reconcile these seemingly contradictory results, as the influence of epigenetic modifications on DNA mechanical properties and nucleosome formation are critical aspects of their study.

      Understanding the influence of sequence-dependent and epigenetic modifications of DNA on mechanical properties and nucleosome formation is crucial for comprehending various cellular processes. The authors’ study, focusing on these aspects, definitely will garner interest from the DNA methylation research community.

      Training the cgNA+ model on alternative MD simulation datasets is certainly of interest to us. However, due to the significant computational cost, this remains a goal for future work. The relationship between nucleosome occupancy scores and nucleosome wrapping energy is still debated, as noted in our Discussion section. The conflicting results may reflect differences in experimental conditions and the contribution of cellular factors other than DNA mechanics to nucleosome formation in vivo. For instance, P´erez et al. (2012), Battistini et al. (2021), and Ngo et al. (2016) concluded that DNA methylation reduces nucleosome formation based on experiments with modified Widom 601 sequences. In contrast, the genome-wide methylation study by Collings and Anderson (2017) found the opposite effect. In our work, we also use whole-genome nucleosome occupancy data.

      Comments on revised version:

      The authors have addressed most of my comments and concerns regarding this manuscript.

      Reviewer #2 (Public Review):

      Summary:

      This study uses a coarse-grained model for double stranded DNA, cgNA+, to assess nucleosome sequence affinity. cgNA+ coarse-grains DNA on the level of bases and accounts also explicitly for the positions of the backbone phosphates. It has been proven to reproduce all-atom MD data very accurately. It is also ideally suited to be incorporated into a nucleosome model because it is known that DNA is bound to the protein core of the nucleosome via the phosphates.

      It is still unclear whether this harmonic model parametrized for unbound DNA is accurate enough to describe DNA inside the nucleosome. Previous models by other authors, using more coarse-grained models of DNA, have been rather successful in predicting base pair sequence dependent nucleosome behavior. This is at least the case as long as DNA shape is concerned whereas assessing the role of DNA bendability (something this paper focuses on) has been consistently challenging in all nucleosome models to my knowledge.

      It is thus of major interest whether this more sophisticated model is also more successful in handling this issue. As far as I can tell the work is technically sound and properly accounts for not only the energy required in wrapping DNA but also entropic effects, namely the change in entropy that DNA experiences when going from the free state to the bound state. The authors make an approximation here which seems to me to be a reasonable first step.

      Of interest is also that the authors have the parameters at hand to study the effect of methylation of CpG-steps. This is especially interesting as this allows to study a scenario where changes in the physical properties of base pair steps via methylation might influence nucleosome positioning and stability in a cell-type specific way.

      Overall, this is an important contribution to the questions of how sequence affects nucleosome positioning and affinity. The findings suggest that cgNA+ has something new to offer. But the problem is complex, also on the experimental side, so many questions remain open. Despite of this, I highly recommend publication of this manuscript.

      Strengths:

      The authors use their state-of-the-art coarse grained DNA model which seems ideally suited to be applied to nucleosomes as it accounts explicitly for the backbone phosphates.

      Weaknesses:

      The authors introduce penalty coefficients c<sub>i</sub> to avoid steric clashes between the two DNA turns in the nucleosome. This requires c<sub>i</sub>-values that are so high that standard deviations in the fluctuations of the simulation are smaller than in the experiments.

      Indeed, smaller c<sub>i</sub> values lead to steric clashes between the two turns of DNA. A possible improvement of our optimisation method and a direction of future work would be adding a penalty which prevents steric clashes to the objective function. Then the c<sub>i</sub> values could be reduced to have bigger fluctuations that are even closer to the experimental structures.

      Reviewer #3 (Public Review):

      Summary:

      In this study, authors utilize biophysical modeling to investigate differences in free energies and nucleosomal configuration probability density of CpG islands and nonmethylated regions in the genome. Toward this goal, they develop and apply the cgNA+ coarse-grained model, an extension of their prior molecular modeling framework.

      Strengths:

      The study utilizes biophysical modeling to gain mechanistic insight into nucleosomal occupancy differences in CpG and nonmethylated regions in the genome.

      Weaknesses:

      Although the overall study is interesting, the manuscripts need more clarity in places. Moreover, the rationale and conclusion for some of the analyses are not well described.

      We have revised the manuscript in accordance with the reviewer’s latest suggestions.

      Comments on revised version:

      Authors have attempted to address previously raised concerns.

      Reviewer #1 (Recommendations for the authors):

      The authors have addressed most of my comments and concerns regarding this manuscript. Among them, the most significant pertains to fitting the coarse-grained model using a different all-atom force field to verify the conclusions. The authors acknowledged this point but noted the computational cost involved and proposed it as a direction for future work. Overall, I recommend the revised version for publication.

      Reviewer #2 (Recommendations for the authors):

      My previous comments were addressed satisfactorily.

      Reviewer #3 (Recommendations for the authors):

      Authors have attempted to address previously raised concerns. However, some concerns listed below remain that need to be addressed.

      (1) The first reviewer makes a valid point regarding the reconciliation of conflicting observations related to nucleosome-forming affinity and wrapping energy. Unfortunately, the authors don’t seem to address this and state that this will be the goal for the future study.

      Training the cgNA+ model on alternative MD simulation datasets remains future work. However, we revised the Discussion section to more clearly address the conflicting experimental findings in the literature on how DNA methylation influences nucleosome formation.

      (2) Please report the effect size and statistical significance value for Figures 7 and 8, as this information is currently not provided, despite the authors’ claim that these observations are statistically significant.

      This information is now presented in Supplementary Tables S1-S4.

      (3) In response to the discrepancy in cell lines for correlating nucleosome occupancy and methylation analyses, the authors claim that there is no publicly available nucleosome occupancy and methylation data for a human cell type within the human genome. This claim is confusing, as the GM12878 cell line has been extensively characterized with MNaseseq and WGBS.

      We thank the reviewer for this remark. We have removed the statement regarding the lack of data from the manuscript; we intend to examine the suggested cell line in future research.

      (4) In response to my question, the authors claimed that they selected regions from chromosome 1 exclusively; however, the observation remains unchanged when considering sequence samples from different genomic regions. They should provide examples from different chromosomes as part of the supplementary information to further support this.

      The examples of corresponding plots for other nucleosomes are now shown in Supplementary Figure S9.

    1. eLife Assessment

      This useful study identifies knowledge of letter shape as a distinct component of letter knowledge and shows that children acquire it even before formal reading instruction and without knowing the corresponding letter sounds. However, the evidence supporting the main conclusions is incomplete at the current stage. With additional analyses examining the relationships among the underlying variables and/or revising interpretations, the work would be of broad interest to researchers studying language and vision.

    2. Reviewer #1 (Public review):

      Summary:

      This study examines letter-shape knowledge in a large cohort of children with minimal formal reading instruction. The authors report that these children can reliably distinguish upright from inverted letters despite limited letter naming abilities. They also show a visual-search advantage for upright over inverted letters, and this advantage correlates with letter-shape familiarity. These findings suggest that specialized letter-shape representations can emerge with very limited letter-sound mapping practice.

      Strengths:

      This study investigates whether children can develop letter-shape knowledge independently of letter-sound mapping abilities. This question is theoretically important, especially in light of functional subdivisions within the visual word form area (VWFA), with posterior regions associated with letter/orthographic shape and anterior regions with linguistic features of orthography (Caffarra et al., 2021; Lerma-Usabiaga et al., 2018). The study also includes a large sample of children at the very beginning of formal reading instruction, thereby minimizing the influence of explicit instruction on the formation of letter-shape knowledge.

      Weakness:

      A central concern is that a production task (naming) is used to index letter-name knowledge, whereas letter-shape knowledge is assessed with recognition. Production tasks impose additional demands (motor planning, articulation) and typically yield lower performance than recognition tasks (e.g., letter-sound verification). Thus, comparisons between letter-shape and letter-name knowledge are confounded by task type. The authors' partial-correlation and multiple-regression analyses linking familiarity (but not production) to the upright-search advantage are informative; however, they do not resolve the recognition-versus-production mismatch. Consequently, the current data cannot unambiguously support the claim that letter-shape representations are independent of letter-name knowledge.

    3. Reviewer #2 (Public review):

      Summary:

      In this study, the authors propose that there are two types of letter knowledge: knowledge about letter sound and knowledge about letter shape. Based on previous studies on implicit statistical learning in adults and babies, the authors hypothesized that passive exposure to letters in the environment allows early readers to acquire knowledge of letter shapes even before knowledge of letter-sound association. Children performed a set of experiments that measures letter shape familiarity, letter-sound association performance, visual processing of letters, and a reading-related cognitive skill. The results show that even the children who have little to no knowledge of letter names are familiar with letter shapes, and that this letter shape familiarity is predictive of performance in visual processing of letters.

      Strengths:

      The authors' hypothesis is based on widely accepted findings in vision science that repeated exposure to certain stimuli promotes implicit learning of, for example, statistical properties of the stimuli. They used simple and well-established tasks in large-scale experiments with a special population (i.e., children). The data analysis is quite comprehensive, accounting for any alternative explanations when needed. The data support at least a part of their hypothesis that the knowledge of letter shapes is distinct from, and precedes, the knowledge of letter-sound association, and is associated with performance in visual processing of the letters. This study shed light on a rather overlooked aspect of letter knowledge, i.e., letter shapes, challenging the idea that letters are learned only through formal instruction and calling for future research on the role of passive exposure to letters in reading acquisition.

      Weaknesses:

      Although the authors have successfully identified the knowledge of letter shapes as another type of letter knowledge other than the knowledge of letter-sound association, the question of whether it drives the subsequent reading acquisition remains largely unanswered, despite it being strongly implied in the Introduction. The authors collected a RAN score, which is known to robustly predict future reading fluency, but it did not show a significant partial correlation with familiarity accuracy (i.e., familiarity accuracy is not necessary to predict RAN score). The authors discussed that the performance in visual processing of letters might capture unique variance in reading fluency unexplained by RAN scores, but currently, this claim seems speculative.

      Since even children without formal literacy instruction were highly familiar with letter shapes, it would be reasonable to assume that they had obtained the knowledge through passive exposure. However, the role of passive exposure was not directly tested in the study.

      Given the superimposed straight lines in Figure 2, I assume the authors computed Pearson correlation coefficients. Testing the statistical significance of the Pearson correlation coefficient requires the assumption of bivariate normality (and therefore constant variance of a variable across the range of the other). According to Figure 2, this doesn't seem to be met, as the familiarity accuracy is hitting the ceiling. The ceiling effect might not be critical in Figure 2, since it tends to attenuate correlation, not inflate it. But in Figures 3 and 4, the authors' conclusion depends on the non-significant partial correlation. In fact, the authors themselves wrote that the ceiling effect might lead to a non-significant correlation even if there is an actual effect (line 404).

    4. Reviewer #3 (Public review):

      Summary:

      This study examined how young children with minimal reading instruction process letters, focusing on their familiarity with letter shapes, knowledge of letter names, and visual discrimination of upright versus inverted letters. Across four experiments, kindergarten and Grade 1 children could identify the correct orientation of letters even without knowing their names.

      Strengths:

      This study addresses an important research gap by examining whether children develop letter familiarity prior to formal literacy instruction and how this skill relates to reading-related cognitive abilities. By emphasizing letter familiarity alongside letter recognition, the study highlights a potentially overlooked yet important component of emergent literacy development.

      Weaknesses:

      The study's methods and results do not effectively test its stated research goals. Reading ability was not directly measured; instead, the authors inferred its relationship with reading from correlations between letter familiarity and reading-related cognitive measures, which limits the validity of their conclusions. Furthermore, the analytical approach was rather limited, relying primarily on simple and partial correlations without employing more advanced statistical methods that could better capture the underlying relationships.

      Major Comments:

      (1) Limited Novelty and Unclear Theoretical Contribution:

      The authors aim to challenge the view that children acquire letter shape knowledge only through formal literacy instruction, but similar questions regarding letter familiarity have already been explored in previous research. The manuscript does not clearly articulate how the present study advances beyond existing findings or why examining letter familiarity specifically before formal instruction provides new theoretical insight. Moreover, if letter familiarity and letter recognition are treated as distinct constructs, the authors should better justify their differentiation and clarify the theoretical significance of focusing on familiarity as an independent component of emergent literacy.

      (2) Overgeneralization to Reading Ability:

      Although the study measured several literacy-related cognitive skills and examined correlations with letter familiarity, it did not directly assess children's reading ability, as participants had not yet received formal literacy instruction. Therefore, the conclusion that letter familiarity influences reading skills (e.g., Line 519: "Our results are broadly consistent with previous work that has highlighted print letter knowledge as a strong predictor of future reading skills") is not fully supported and should be clarified or revised. To draw conclusions about the impact on reading ability, a longitudinal study would be more appropriate, assessing the relationship between letter familiarity and reading skills after children have received formal literacy instruction. If a longitudinal study is not feasible, measuring familial risk for dyslexia could provide an alternative approach to infer the potential influence of letter familiarity on later reading development.

      (3) Confusing and Limited Analytical Approach with Potential for More Sophisticated Modeling:

      The study employs a confusing analytical approach, alternating between simple correlational analyses and group-based comparisons, which may introduce circularity - for example, defining high vs. low familiarity groups partly based on performance differences in upright versus inverted letters and then observing a visual search advantage for upright letters within these groups. Moreover, the analyses are relatively simple: although multiple linear regression is mentioned, the results are not fully reported. These approaches may not fully capture the complex relationships among letter familiarity, recognition, visual search performance, RAN, and other covariates. More sophisticated modeling, such as mixed-effects models to account for repeated measures, structural equation modeling to examine latent constructs, or multivariate approaches jointly modeling familiarity and recognition effects, could provide a clearer understanding of the unique contribution of letter shape familiarity to early literacy outcomes. In addition, a large number of correlations were conducted without correction for multiple comparisons, which may increase the risk of false positives and raise concerns about the reliability of some significant findings.

    1. eLife Assessment

      This important work develops a new protocol to experimentally perturb target genes across a quantitative range of expression levels in cell lines. The evidence supporting their new perturbation approach is convincing, and we propose that focusing on single modality (activation or inhibition) would be sufficient to draw their conclusions. The study will be of broad interest to scientists in the fields of functional genomics and biotechnology.

    2. Reviewer #1 (Public review):

      In this manuscript, Domingo et al. present a novel perturbation-based approach to experimentally modulate the dosage of genes in cell lines. Their approach is capable of gradually increasing and decreasing gene expression. The authors then use their approach to perturb three key transcription factors and measure the downstream effects on gene expression. Their analysis of the dosage response curve of downstream genes reveals marked non-linearity.

      One of the strengths of this study is that many of the perturbations fall within the physiological range for each cis gene. This range is presumably between a single-copy state of heterozygous loss-of-function (log fold change of -1) and a three-copy state (log fold change of ~0.6). This is in contrast with CRISPRi or CRISPRa studies that attempt to maximize the effect of the perturbation, which may result in downstream effects that are not representative of physiological responses.

      Another strength of the study is that various points along the dosage-response curve were assayed for each perturbed gene. This allowed the authors to effectively characterize the degree of linearity and monotonicity of each dosage-response relationship. Ultimately, the study revealed that many of these relationships are non-linear, and that the response to activation can be dramatically different than the response to inhibition.

      To test their ability to gradually modulate dosage, the authors chose to measure three transcription factors and around 80 known downstream targets. As the authors themselves point out in their discussion about MYB, this biased sample of genes makes it unclear how this approach would generalize genome-wide. In addition, the data generated from this small sample of genes may not represent genome-wide patterns of dosage response. Nevertheless, this unique data set and approach represents a first step in understanding dosage-response relationships between genes.

      Another point of general concern in such screens is the use of the immortalized K562 cell line. It is unclear how the biology of these cell lines translates to the in vivo biology of primary cells. However, the authors do follow up with cell-type-specific analyses (Figures 4B, 4C, and 5A) to draw correspondence between their perturbation results and the relevant biology in primary cells and complex diseases.

      The conclusions of the study are generally well supported with statistical analysis throughout the manuscript. As an example, the authors utilize well-known model selection methods to identify when there was evidence for non-linear dosage response relationships.

      Gradual modulation of gene dosage is a useful approach to model physiological variation in dosage. Experimental perturbation screens that use CRISPR inhibition or activation often use guide RNAs targeting the transcription start site to maximize their effect on gene expression. Generating a physiological range of variation will allow others to better model physiological conditions.

      There is broad interest in the field to identify gene regulatory networks using experimental perturbation approaches. The data from this study provides a good resource for such analytical approaches, especially since both inhibition and activation were tested. In addition, these data provide a nuanced, continuous representation of the relationship between effectors and downstream targets, which may play a role in the development of more rigorous regulatory networks.

      Human geneticists often focus on loss-of-function variants, which represent natural knock-down experiments, to determine the role of a gene in the biology of a trait. This study demonstrates that dosage response relationships are often non-linear, meaning that the effect of a loss-of-function variant may not necessarily carry information about increases in gene dosage. For the field, this implies that others should continue to focus on both inhibition and activation to fully characterize the relationship between gene and trait.

      Comments on revisions:

      Thank you for responding to our comments. We have no further comments for the authors.

    3. Reviewer #2 (Public review):

      Summary:

      This work investigates transcriptional responses to varying levels of transcription factors (TFs). The authors aim for gradual up- and down-regulation of three transcription factors GFI1B, NFE2 and MYB in K562 cells, by using a CRISPRa- and a CRISPRi line, together with sgRNAs of varying potency. Targeted single-cell RNA sequencing is then used to measure gene expression of a set of 90 genes, which were previously shown to be downstream of GFI1B and NFE2 regulation. This is followed by an extensive computational analysis of the scRNA-seq dataset. By grouping cells with the same perturbations, the authors can obtain groups of cells with varying average TF expression levels. The achieved perturbations are generally subtle, not reaching half or double doses for most samples, and up-regulation is generally weak below 1.5-fold in most cases. Even in this small range, many target genes exhibit a non-linear response. Since this is rather unexpected, it is crucial to rule out technical reasons for these observations.

      Strengths:

      The work showcases how a single dataset of CRISPRi/a perturbations with scRNA-seq readout and an extended computational analysis can be used to estimate transcriptome dose-responses, a general approach that likely can be built upon in the future.<br /> Moreover, the authors highlight tiling of sgRNAs +/-1000bp around TSS as a useful approach. Compared with conventional direct TSS-targeting (+/- 200 bp), the larger sequence window allows placing more sgRNAs. Also it requires little prior knowledge of CREs, and avoids using "attenuated" sgRNAs which would require specialized sgRNA design.

      Weaknesses:

      The experiment was performed in a single replicate and it would have been reassuring to see an independent validation of the main findings, for example through measuring individual dose-response curves .

      Much of the analysis depends on the estimation of log-fold changes between groups of single cells with non-targeting controls and those carrying a guide RNA driving a specific knockdown. Generally, biological replicates are recommended for differential gene expression testing (Squair et al. 2021, https://doi.org/10.1038/s41467-021-25960-2). When using the FindMarkers function from the Seurat package, the authors divert from the recommendations for pseudo-bulk analysis to aggregate the raw counts (https://satijalab.org/seurat/articles/de_vignette.html). Furthermore, differential gene expression analysis of scRNA-seq data can suffer from mis-estimations (Nguyen et al. 2023, https://doi.org/10.1038/s41467-023-37126-3), and different computational tools or versions can affect these estimates strongly (Pullin et al. 2024, https://doi.org/10.1186/s13059-024-03183-0 and Rich et al. 2024, https://doi.org/10.1101/2024.04.04.588111). Therefore it would be important to describe more precisely in the Methods how this analysis was performed, any deviations from default parameters, package versions, and at which point which values were aggregated to form "pseudobulk" samples.

      Two different cell lines are used to construct dose-response curves, where a CRISPRi line allows gene down-regulation and the CRISPRa line allows gene upregulation. Although both lines are derived from the same parental line (K562) the expression analysis of Tet2, which is absent in the CRISPRi line, but expressed in the CRISPRa line (Fig. S1F, S3A) suggests clonal differences between the two lines. Similarly, the UMAP in S3C and the PCA in S4A suggest batch effects between the two lines. These might confound this analysis, even though all fold changes are calculated relative to the baseline expression in the respective cell line (NTC cells). Combining log2-fold changes from the two cell lines with different baseline expression into a single curve (e.g. Fig. 3) remains misleading, because different data points could be normalized to different base line expression levels.

      The study estimates the relationship between TF dose and target gene expression. This requires a system that allows quantitative changes in TF expression. The data provided does not convincingly show that this condition is met, which however is an essential prerequisite for the presented conclusions. Specifically, the data shown in Fig. S3A shows that upon stronger knock-down, a subpopulation of cells appear, where the targeted TF is not detected any more (drop-outs). Also in Fig. 3B (top) suggests that the knock-down is either subtle (similar to NTCs) or strong, but intermediate knock-down (log2-FC of 0.5-1) does not occur. Although the authors argue that this is a technical effect of the scRNA-seq protocol, it is also possible that this represents a binary behavior of the CRISPRi system. Previous work has shown that CRISPRi systems with the KRAB domain largely result in binary repression and not in gradual down-regulation as suggested in this study (Bintu et al. 2016 (https://doi.org/10.1126/science.aab2956), Noviello et al. 2023 (https://doi.org/10.1038/s41467-023-38909-4)).

      One of the major conclusions of the study is that non-linear behavior is common. It would be helpful to show that this observation does not arise from the technical concerns described in the previous points. This could be done for instance with independent experimental validations.

      Did the authors achieve their aims? Do the results support the conclusions?:

      Some of the most important conclusions, such as the claim that non-linear responses are common, are not well supported because they rely on accurately determining the quantitative responses of trans genes, which suffers from the previously mentioned concerns.

      Discussion of the likely impact of the work on the field, and the utility of the methods and data to the community:

      Together with other recent publications, this work emphasizes the need to study transcription factor function with quantitative perturbations. The computational code repository contains all the valuable code with inline comments, but would have benefited from a readme file explaining the repository structure, package versions, and instructions to reproduce the analyses, including which input files or directory structure would be needed.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In this manuscript, Domingo et al. present a novel perturbation-based approach to experimentally modulate the dosage of genes in cell lines. Their approach is capable of gradually increasing and decreasing gene expression. The authors then use their approach to perturb three key transcription factors and measure the downstream effects on gene expression. Their analysis of the dosage response curve of downstream genes reveals marked non-linearity.

      One of the strengths of this study is that many of the perturbations fall within the physiological range for each cis gene. This range is presumably between a single-copy state of heterozygous loss-of-function (log fold change of -1) and a three-copy state (log fold change of ~0.6). This is in contrast with CRISPRi or CRISPRa studies that attempt to maximize the effect of the perturbation, which may result in downstream effects that are not representative of physiological responses.

      Another strength of the study is that various points along the dosage-response curve were assayed for each perturbed gene. This allowed the authors to effectively characterize the degree of linearity and monotonicity of each dosage-response relationship. Ultimately, the study revealed that many of these relationships are non-linear, and that the response to activation can be dramatically different than the response to inhibition.

      To test their ability to gradually modulate dosage, the authors chose to measure three transcription factors and around 80 known downstream targets. As the authors themselves point out in their discussion about MYB, this biased sample of genes makes it unclear how this approach would generalize genome-wide. In addition, the data generated from this small sample of genes may not represent genome-wide patterns of dosage response. Nevertheless, this unique data set and approach represents a first step in understanding dosage-response relationships between genes.

      Another point of general concern in such screens is the use of the immortalized K562 cell line. It is unclear how the biology of these cell lines translates to the in vivo biology of primary cells. However, the authors do follow up with cell-type-specific analyses (Figures 4B, 4C, and 5A) to draw a correspondence between their perturbation results and the relevant biology in primary cells and complex diseases.

      The conclusions of the study are generally well supported with statistical analysis throughout the manuscript. As an example, the authors utilize well-known model selection methods to identify when there was evidence for non-linear dosage response relationships.

      Gradual modulation of gene dosage is a useful approach to model physiological variation in dosage. Experimental perturbation screens that use CRISPR inhibition or activation often use guide RNAs targeting the transcription start site to maximize their effect on gene expression. Generating a physiological range of variation will allow others to better model physiological conditions.

      There is broad interest in the field to identify gene regulatory networks using experimental perturbation approaches. The data from this study provides a good resource for such analytical approaches, especially since both inhibition and activation were tested. In addition, these data provide a nuanced, continuous representation of the relationship between effectors and downstream targets, which may play a role in the development of more rigorous regulatory networks.

      Human geneticists often focus on loss-of-function variants, which represent natural knock-down experiments, to determine the role of a gene in the biology of a trait. This study demonstrates that dosage response relationships are often non-linear, meaning that the effect of a loss-of-function variant may not necessarily carry information about increases in gene dosage. For the field, this implies that others should continue to focus on both inhibition and activation to fully characterize the relationship between gene and trait.

      We thank the reviewer for their thoughtful and thorough evaluation of our study. We appreciate their recognition of the strengths of our approach, particularly the ability to modulate gene dosage within a physiological range and to capture non-linear dosage-response relationships. We also agree with the reviewer’s points regarding the limitations of gene selection and the use of K562 cells, and we are encouraged that the reviewer found our follow-up analyses and statistical framework to be well-supported. We believe this work provides a valuable foundation for future genome-wide applications and more physiologically relevant perturbation studies.

      Reviewer #2 (Public review):

      Summary:

      This work investigates transcriptional responses to varying levels of transcription factors (TFs). The authors aim for gradual up- and down-regulation of three transcription factors GFI1B, NFE2, and MYB in K562 cells, by using a CRISPRa- and a CRISPRi line, together with sgRNAs of varying potency. Targeted single-cell RNA sequencing is then used to measure gene expression of a set of 90 genes, which were previously shown to be downstream of GFI1B and NFE2 regulation. This is followed by an extensive computational analysis of the scRNA-seq dataset. By grouping cells with the same perturbations, the authors can obtain groups of cells with varying average TF expression levels. The achieved perturbations are generally subtle, not reaching half or double doses for most samples, and up-regulation is generally weak below 1.5-fold in most cases. Even in this small range, many target genes exhibit a non-linear response. Since this is rather unexpected, it is crucial to rule out technical reasons for these observations.

      We thank the reviewer for their detailed and thoughtful assessment of our work. We are encouraged by their recognition of the strengths of our study, including the value of quantitative CRISPR-based perturbation coupled with single-cell transcriptomics, and its potential to inform gene regulatory network inference. Below, we address each of the concerns raised:

      Strengths:

      The work showcases how a single dataset of CRISPRi/a perturbations with scRNA-seq readout and an extended computational analysis can be used to estimate transcriptome dose responses, a general approach that likely can be built upon in the future.

      Weaknesses:

      (1) The experiment was only performed in a single replicate. In the absence of an independent validation of the main findings, the robustness of the observations remains unclear.

      We acknowledge that our study was performed in a single pooled experiment. While additional replicates would certainly strengthen the findings, in high-throughput single-cell CRISPR screens, individual cells with the same perturbation serve as effective internal replicates. This is a common practice in the field. Nevertheless, we agree that biological replicates would help control for broader technical or environmental effects.

      (2) The analysis is based on the calculation of log-fold changes between groups of single cells with non-targeting controls and those carrying a guide RNA driving a specific knockdown. How the fold changes were calculated exactly remains unclear, since it is only stated that the FindMarkers function from the Seurat package was used, which is likely not optimal for quantitative estimates. Furthermore, differential gene expression analysis of scRNA-seq data can suffer from data distortion and mis-estimations (Heumos et al. 2023 (https://doi.org/10.1038/s41576-023-00586-w), Nguyen et al. 2023 (https://doi.org/10.1038/s41467-023-37126-3)). In general, the pseudo-bulk approach used is suitable, but the correct treatment of drop-outs in the scRNA-seq analysis is essential.

      We thank the reviewer for highlighting recent concerns in the field. A study benchmarking association testing methods for perturb-seq data found that among existing methods, Seurat’s FindMarkers function performed the best (T. Barry et al. 2024).

      In the revised Methods, we now specify the formula used to calculate fold change and clarify that the estimates are derived from the Wilcoxon test implemented in Seurat’s FindMarkers function. We also employed pseudo-bulk grouping to mitigate single-cell noise and dropout effects.

      (3) Two different cell lines are used to construct dose-response curves, where a CRISPRi line allows gene down-regulation and the CRISPRa line allows gene upregulation. Although both lines are derived from the same parental line (K562) the expression analysis of Tet2, which is absent in the CRISPRi line, but expressed in the CRISPRa line (Figure S3A) suggests substantial clonal differences between the two lines. Similarly, the PCA in S4A suggests strong batch effects between the two lines. These might confound this analysis.

      We agree that baseline differences between CRISPRi and CRISPRa lines could introduce confounding effects if not appropriately controlled for. We emphasize that all comparisons are made as fold changes relative to non-targeting control (NTC) cells within each line, thereby controlling for batch- and clone-specific baseline expression. See figures S4A and S4B.

      (4) The study uses pseudo-bulk analysis to estimate the relationship between TF dose and target gene expression. This requires a system that allows quantitative changes in TF expression. The data provided does not convincingly show that this condition is met, which however is an essential prerequisite for the presented conclusions. Specifically, the data shown in Figure S3A shows that upon stronger knock-down, a subpopulation of cells appears, where the targeted TF is not detected anymore (drop-outs). Also Figure 3B (top) suggests that the knock-down is either subtle (similar to NTCs) or strong, but intermediate knock-down (log2-FC of 0.5-1) does not occur. Although the authors argue that this is a technical effect of the scRNA-seq protocol, it is also possible that this represents a binary behavior of the CRISPRi system. Previous work has shown that CRISPRi systems with the KRAB domain largely result in binary repression and not in gradual down-regulation as suggested in this study (Bintu et al. 2016 (https://doi.org/10.1126/science.aab2956), Noviello et al. 2023 (https://doi.org/10.1038/s41467-023-38909-4)).

      Figure S3A shows normalized expression values, not fold changes. A pseudobulk approach reduces single-cell noise and dropout effects. To test whether dropout events reflect true binary repression or technical effects, we compared trans-effects across cells with zero versus low-but-detectable target gene expression (Figure S3B). These effects were highly concordant, supporting the interpretation that dropout is largely technical in origin. We agree that KRAB-based repression can exhibit binary behavior in some contexts, but our data suggest that cells with intermediate repression exist and are biologically meaningful. In ongoing unpublished work, we pursue further analysis of these data at the single cell level, and show that for nearly all guides the dosage effects are indeed gradual rather than driven by binary effects across cells.

      (5) One of the major conclusions of the study is that non-linear behavior is common. This is not surprising for gene up-regulation, since gene expression will reach a plateau at some point, but it is surprising to be observed for many genes upon TF down-regulation. Specifically, here the target gene responds to a small reduction of TF dose but shows the same response to a stronger knock-down. It would be essential to show that his observation does not arise from the technical concerns described in the previous point and it would require independent experimental validations.

      This phenomenon—where relatively small changes in cis gene dosage can exceed the magnitude of cis gene perturbations—is not unique to our study. This also makes biological sense, since transcription factors are known to be highly dosage sensitive and generally show a smaller range of variation than many other genes (that are regulated by TFs). Empirically, these effects have been observed in previous CRISPR perturbation screens conducted in K562 cells, including those by Morris et al. (2023), Gasperini et al. (2019), and Replogle et al. (2022), to name but a few studies that our lab has personally examined the data of.

      (6) One of the conclusions of the study is that guide tiling is superior to other methods such as sgRNA mismatches. However, the comparison is unfair, since different numbers of guides are used in the different approaches. Relatedly, the authors point out that tiling sometimes surpassed the effects of TSS-targeting sgRNAs, however, this was the least fair comparison (2 TSS vs 10 tiling guides) and additionally depends on the accurate annotation of TSS in the relevant cell line.

      We do not draw this conclusion simply from observing the range achieved but from a more holistic observation. We would like to clarify that the number of sgRNAs used in each approach is proportional to the number of base pairs that can be targeted in each region: while the TSS-targeting strategy is typically constrained to a small window of a few dozen base pairs, tiling covers multiple kilobases upstream and downstream, resulting in more guides by design rather than by experimental bias. The guides with mismatches do not have a great performance for gradual upregulation.

      We would also like to point out that the observation that the strongest effects can arise from regions outside the annotated TSS is not unique to our study and has been demonstrated in prior work (referenced in the text).

      To address this concern, we have revised the text to clarify that we do not consider guide tiling to be inherently superior to other approaches such as sgRNA mismatches. Rather, we now describe tiling as a practical and straightforward strategy to obtain a wide range of gene dosage effects without requiring prior knowledge beyond the approximate location of the TSS. We believe this rephrasing more accurately reflects the intent and scope of our comparison.

      (7) Did the authors achieve their aims? Do the results support the conclusions?: Some of the most important conclusions are not well supported because they rely on accurately determining the quantitative responses of trans genes, which suffers from the previously mentioned concerns.

      We appreciate the reviewer’s concern, but we would have wished for a more detailed characterization of which conclusions are not supported, given that we believe our approach actually accounts for the major concerns raised above. We believe that the observation of non-linear effects is a robust conclusion that is also consistent with known biology, with this paper introducing new ways to analyze this phenomenon.

      (8) Discussion of the likely impact of the work on the field, and the utility of the methods and data to the community:

      Together with other recent publications, this work emphasizes the need to study transcription factor function with quantitative perturbations. Missing documentation of the computational code repository reduces the utility of the methods and data significantly.

      Documentation is included as inline comments within the R code files to guide users through the analysis workflow.

      Reviewer #1 (Recommendations for the authors):

      In Figure 3C (and similar plots of dosage response curves throughout the manuscript), we initially misinterpreted the plots because we assumed that the zero log fold change on the horizontal axis was in the middle of the plot. This gives the incorrect interpretation that the trans genes are insensitive to loss of GFI1B in Figure 3C, for instance. We think it may be helpful to add a line to mark the zero log fold change point, as was done in Figure 3A.

      We thank the reviewer for this helpful suggestion. To improve clarity, we have added a vertical line marking the zero log fold change point in Figure 3C and all similar dosage-response plots. We agree this makes the plots easier to interpret at a glance.

      Similarly, for heatmaps in the style of Figure 3B, it may be nice to have a column for the non-targeting controls, which should be a white column between the perturbations that increase versus decrease GFI1B.

      We appreciate the suggestion. However, because all perturbation effects are computed relative to the non-targeting control (NTC) cells, explicitly including a separate column for NTC in the heatmap would add limited interpretive value and could unnecessarily clutter the figure. For clarity, we have emphasized in the figure legend that the fold changes are relative to the NTC baseline.

      We found it challenging to assess the degree of uncertainty in the estimation of log fold changes throughout the paper. For example, the authors state the following on line 190: "We observed substantial differences in the effects of the same guide on the CRISPRi and CRISPRa backgrounds, with no significant correlation between cis gene fold-changes." This claim was challenging to assess because there are no horizontal or vertical error bars on any of the points in Figure 2A. If the log fold change estimates are very noisy, the data could be consistent with noisy observations of a correlated underlying process. Similarly, to our understanding, the dosage response curves are fit assuming that the cis log fold changes are fixed. If there is excessive noise in the estimation of these log fold changes, it may bias the estimated curves. It may be helpful to give an idea of the amount of estimation error in the cis log fold changes.

      We agree that assessing the uncertainty in log fold change estimates is important for interpreting both the lack of correlation between CRISPRi and CRISPRa effects (Figure 2A) and the robustness of the dosage-response modeling.

      In response, we have now updated Figure 2A to include both vertical and horizontal error bars, representing the standard errors of the log2 fold-change estimates for each guide in the CRISPRi and CRISPRa conditions. These error estimates were computed based on the differential expression analysis performed using the FindMarkers function in Seurat, which models gene expression differences between perturbed and control cells. We also now clarify this in the figure legend and methods.

      The authors mention hierarchical clustering on line 313, which identified six clusters. Although a dendrogram is provided, these clusters are not displayed in Figure 4A. We recommend displaying these clusters alongside the dendrogram.

      We have added colored bars indicating the clusters to improve the clarity. Thank you for the suggestion.

      In Figures 4B and 4C, it was not immediately clear what some of the gene annotations meant. For example, neither the text nor the figure legend discusses what "WBCs", "Platelets", "RBCs", or "Reticulocytes" mean. It would be helpful to include this somewhere other than only the methods to make the figure more clear.

      To improve clarity, we have updated the figure legends for Figures 4B and 4C to explicitly define these abbreviations.

      We struggled to interpret Figure 4E. Although the authors focus on the association of MYB with pHaplo, we would have appreciated some general discussion about the pattern of associations seen in the figure and what the authors expected to observe.

      We have changed the paragraph to add more exposition and clarification:

      “The link between selective constraint and response properties is most apparent in the MYB trans network. Specifically, the probability of haploinsufficiency (pHaplo) shows a significant negative correlation with the dynamic range of transcriptional responses (Figure 4G): genes under stronger constraint (higher pHaplo) display smaller dynamic ranges, indicating that dosage-sensitive genes are more tightly buffered against changes in MYB levels. This pattern was not reproduced in the other trans networks (Figure 4E)”.

      Line 71: potentially incorrect use of "rending" and incorrect sentence grammar.

      Fixed

      Line 123: "co-expression correlation across co-expression clusters" - authors may not have intended to use "co-expression" twice.

      Original sentence was correct.

      Line 246: "correlations" is used twice in "correlations gene-specific correlations."

      Fixed.

      Reviewer #2 (Recommendations for the authors):

      (1) To show that the approach indeed allows gradual down-regulation it would be important to quantify the know-down strength with a single-cell readout for a subset of sgRNAs individually (e.g. flowfish/protein staining flow cytometry).

      We agree that single-cell validation of knockdown strength using orthogonal approaches such as flowFISH or protein staining would provide additional support. However, such experiments fall outside the scope of the current study and are not feasible at this stage. We note that the observed transcriptomic changes and dosage responses across multiple perturbations are consistent with effective and graded modulation of gene expression.

      (2) Similarly, an independent validation of the observed dose-response relationships, e.g. with individual sgRNAs, can be helpful to support the conclusions about non-linear responses.

      Fig. S4C includes replication of trans-effects for a handful of guides used both in this study and in Morris et al. While further orthogonal validation of dose-response relationships would be valuable, such extensive additional work is not currently feasible within the scope of this study. Nonetheless, the high degree of replication in Fig. S4C as well as consistency of patterns observed across multiple sgRNAs and target genes provides strong support for the conclusions drawn from our high-throughput screen.

      (3) The calculation of the log2 fold changes should be documented more precisely. To perform a pseudo-bulk analysis, the raw UMI counts should be summed up in each group (NTC, individual targeting sgRNAs), including zero counts, then the data should be normalized and the fold change should be calculated. The DESeq package for example would be useful here.

      We have updated the methods in the manuscript to provide more exposition of how the logFC was calculated:

      “In our differential expression (DE) analysis, we used Seurat’s FindMarkers() function, which computes the log fold change as the difference between the average normalized gene expression in each group on the natural log scale:

      Logfc = log_e(mean(expression in group 1)) - log_e(mean(expression in group 2))

      This is calculated in pseudobulk where cells with the same sgRNA are grouped together and the mean expression is compared to the mean expression of cells harbouring NTC guides. To calculate per-gene differential expression p-value between the two cell groups (cells with sgRNA vs cells with NTC), Wilcoxon Rank-Sum test was used”.

      (4) A more careful characterization of the cell lines used would be helpful. First, it would be useful to include the quality controls performed when the clonal lines were selected, in the manuscript. Moreover, a transcriptome analysis in comparison to the parental cell line could be performed to show that the cell lines are comparable. In addition, it could be helpful to perform the analysis of the samples separately to see how many of the response behaviors would still be observed.

      Details of the quality control steps used during the selection of the CRISPRa clonal line are already included in the Methods section, and Fig. S4A shows the transcriptome comparison of CRISPRi and CRISPRa lines also for non-targeting guides. Regarding the transcriptomic comparison with the parental cell line, we agree that such an analysis would be informative; however, this would require additional experiments that are not feasible within the scope of the current study. Finally, while analyzing the samples separately could provide further insight into response heterogeneity, we focused on identifying robust patterns across perturbations that are reproducible in our pooled screening framework. We believe these aggregate analyses capture the major response behaviors and support the conclusions drawn.

      (5) In general we were surprised to see such strong responses in some of the trans genes, in some cases exceeding the fold changes of the cis gene perturbation more than 2x, even at the relatively modest cis gene perturbations (Figures S5-S8). How can this be explained?

      This phenomenon—where trans gene responses can exceed the magnitude of cis gene perturbations—is not unique to our study. Similar effects have been observed in previous CRISPR perturbation screens conducted in K562 cells, including those by Morris et al. (2023), Gasperini et al. (2019), and Replogle et al. (2022).

      Several factors may contribute to this pattern. One possibility is that certain trans genes are highly sensitive to transcription factor dosage, and therefore exhibit amplified expression changes in response to relatively modest upstream perturbations. Transcription factors are known to be highly dosage sensitive and generally show a smaller range of variation than many other genes (that are regulated by TFs). Mechanistically, this may involve non-linear signal propagation through regulatory networks, in which intermediate regulators or feedback loops amplify the downstream transcriptional response. While our dataset cannot fully disentangle these indirect effects, the consistency of this observation across multiple studies suggests it is a common feature of transcriptional regulation in K562 cells.

      (6) In the analysis shown in Figure S3B, the correlation between cells with zero count and >0 counts for the cis gene is calculated. For comparison, this analysis should also show the correlation between the cells with similar cis-gene expression and between truly different populations (e.g. NTC vs strong sgRNA).

      The intent of Figure S3B was not to compare biologically distinct populations or perform differential expression analyses—which we have already conducted and reported elsewhere in the manuscript—but rather to assess whether fold change estimates could be biased by differences in the baseline expression of the target gene across individual cells. Specifically, we sought to determine whether cells with zero versus non-zero expression (as can result from dropouts or binary on/off repression from the KRAB-based CRISPRi system) exhibit systematic differences that could distort fold change estimation. As such, the comparisons suggested by the reviewer do not directly relate to the goal of the analysis which Figure S3B was intended to show.

      (7) It is unclear why the correlation between different lanes is assessed as quality control metrics in Figure S1C. This does not substitute for replicates.

      The intent of Figure S1C was not to serve as a general quality control metric, but rather to illustrate that the targeted transcript capture approach yielded consistent and specific signal across lanes. We acknowledge that this may have been unclear and have revised the relevant sentence in the text to avoid misinterpretation.

      “We used the protein hashes and the dCas9 cDNA (indicating the presence or absence of the KRAB domain) to demultiplex and determine the cell line—CRISPRi or CRISPRa. Cells containing a single sgRNA were identified using a Gaussian mixture model (see Methods). Standard quality control procedures were applied to the scRNA-seq data (see Methods). To confirm that the targeted transcript capture approach worked as intended, we assessed concordance across capture lanes (Figure S1C)”.

      (8) Figures and legends often miss important information. Figure 3B and S5-S8: what do the transparent bars represent? Figure S1A: color bar label missing. Figure S4D: what are the lines?, Figure S9A: what is the red line? In Figure S8 some of the fitted curves do not overlap with the data points, e.g. PKM. Fig. 2C: why are there more than 96 guide RNAs (see y-axis)?

      We have addressed each point as follows:

      Figure 3B: The figure legend has been updated to clarify the meaning of the transparent bars.

      Figures S5–S8: There are no transparent bars in these figures; we confirmed this in the source plots.

      Figure S1A: The color bar label is already described in the figure legend, but we have reformulated the caption text to make this clearer.

      Figure S4D: The dashed line represents a linear regression between the x and y variables. The figure caption has been updated accordingly.

      Figure S9A: We clarified that the red line shows the median ∆AIC across all genes and conditions.

      Figure S8: We agree that some fitted curves (e.g., PKM) do not closely follow the data points. This reflects high noise in these specific measurements; as noted in the text, TET2 is not expected to exert strong trans effects in this context.

      Figure 2C: Thank you for catching this. The y-axis numbers were incorrect because the figure displays the proportion of guides (summing to 100%), not raw counts. We have corrected the y-axis label and updated the numbers in the figure to resolve this inconsistency.

      (9) The code is deposited on Github, but documentation is missing.

      Documentation is included as inline comments within the R code files to guide users through the analysis workflow.

      (10) The methods miss a list of sgRNA target sequences.

      We thank the reviewer for this observation. A complete table containing all processed data, including the sequences of the sgRNAs used in this study, is available at the following GEO link:

      https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE257547&format=file&file=GSE257547%5Fd2n%5Fprocessed%5Fdata%2Etxt%2Egz

      (11) In some parts, the language could be more specific and/or the readability improved, for example:

      Line 88: "quantitative landscape".

      Changed to “quantitative patterns”.

      Lines 88-91: long sentence hard to read.

      This complex sentence was broken up into two simpler ones:

      “We uncovered quantitative patterns of how gradual changes in transcription dosage lead to linear and non-linear responses in downstream genes. Many downstream genes are associated with rare and complex diseases, with potential effects on cellular phenotypes”.

      Line 110: "tiling sgRNAs +/- 1000 bp from the TSS", could maybe be specified by adding that the average distance was around 100 or 110 bps?

      Lines 244-246: hard to understand.

      We struggle to see the issue here and are not sure how it can be reworded.

      Lines 339-342: hard to understand.

      These sentences have been reworded to provide more clarity.

      (12) A number of typos, and errors are found in the manuscript:

      Line 71: "SOX2" -> "SOX9".

      FIXED

      Line 73: "rending" -> maybe "raising" or "posing"?

      FIXED

      Line 157: "biassed".

      FIXED

      Line 245: "exhibited correlations gene-specific correlations with".

      FIXED

      Multiple instances, e.g. 261: "transgene" -> "trans gene".

      FIXED

      Line 332: "not reproduced with among the other".

      FIXED

      Figure S11: betweenness.

      This is the correct spelling

      There are more typos that we didn't list here.

      We went through the manuscript and corrected all the spelling errors and typos.

    1. eLife Assessment

      This study presents a valuable tool named TSvelo, a computational framework for RNA velocity inference that models transcriptional regulation and gene-specific splicing. The evidence supporting the claims of the authors is solid, although elaboration of the computational benchmark and datasets would have strengthened the study. The work will be of interest to computational scientists working in the field of RNA biology.

    2. Reviewer #1 (Public review):

      Summary:

      In the paper, the authors propose a new RNA velocity method, TSvelo, which predicts the transcription rate linearly based on the expression of RNA levels of transcription factors. This framework is an extension of its recent work TFvelo by including unspliced reads and designing a coherent neuralODE framework. Improved performance was demonstrated in six diverse datasets.

      Strengths:

      Overall, this method introduces innovative solutions to link cell differentiation and gene regulation, with a balance between model complexity (neuralODE) and interpretability (raw gene space).

      Weaknesses:

      While it seems to provide convincing results, there are multiple technical concerns for the authors to clarify and double-check.

      (1) The authors should clarify and discuss the TF-target map: here, the TF-target genes map is predefined by the TF binding's ChIP-seq data. This annotation is largely incomplete and mostly compiled from a set of bulk tissues. Therefore, for a certain population, the TF-target relation may change. This requires clarification and discussion, possibly exploring how to address this in the model. In addition, a regulon database could be added, e.g., DoRothEA?

      (2) The authors should clarify how example genes are selected. This is particularly unclear in Figure 2d.

      (3) The authors should clarify confidence in the statement in lines 179-180, that ANXA4 should initially decrease. This is particularly concerning, as TSvelo didn't capture the cell cycle transitions well during the initial part.

      (4) A support reference should be added for the statement in line 260 that "neuron migrations are inside-out manner". There is no reference supporting this, and this statement is critical for the model assessment.

      (5) The comparison to scMultiomics data is particularly interesting, as MultiVelo uses ATAC data to predict the transcription rate. It would be very insightful to add a direct comparison of the estimated transcription rate between using ATAC and directly using TFs' RNA expressions.

      (6) In Figure 6g, it should be clarified how the lineage was determined. Did the authors use the LARRY barcodes, predicted cell fate, or any other methods? Here, the best way is probably using the LARRY barcodes for individual clones.

    3. Reviewer #2 (Public review):

      Summary:

      Li et al. propose TSvelo, a computational framework for RNA velocity inference that models transcriptional regulation and gene-specific splicing using a neural ODE approach. The method is intended to improve trajectory reconstruction and capture dynamic gene expression changes in scRNA-seq data. However, the manuscript in its current form falls short in several critical areas, including rigorous validation, quantitative benchmarking, clarity of definitions, proper use of prior knowledge, and interpretive caution. Many of the authors' claims are not fully supported by the evidence.

      Major comments:

      (1) Modeling comments

      (a) Lines 512-513: How does the U-to-S delay validate the accuracy of pseudotime? Using only a single gene as an example is not sufficient for "validation."

      (b) Lines 512-518: The authors propose a strategy for selecting the initial state, but do not benchmark how accurate this selection procedure is, nor do they provide sufficient rationale. While some genes may indeed exhibit U-to-S delay during lineage differentiation, why does the highest U-to-S delay score indicate the correct initiation states? Please provide mathematical justification and demonstrate accuracy beyond using a single gene example. Maybe a simulation with ground truth could help here, too.

      (c) Equation (8): The formulation looks to be incorrect. If $$W \in \mathbb{R}^{G\times G}$$ and $$W' - \Gamma' \in \mathbb{R}^{K\times K}$$, how can they be aligned within the same row? Please clarify.

      (d) The use of prior knowledge graphs from ENCODE or ChEA to constrain regulation raises concerns. Much of the regulatory information in these databases comes from cell lines. How can such cell-line-based regulation be reliably applied to primary tissues, as is done throughout the manuscript? Additional experiments are needed to test the robustness of TSvelo with respect to prior knowledge.

      (e) Lines 579-580: How is the grid search performed? More methodological details are required. If an existing method was used, please provide a citation.

      (2) Application on pancreatic endocrine datasets

      (a) Lines 140-141: What is the definition of the final pseudotime-fitted time t or velocity pseudotime?

      (b) Lines 143-144: The use of the velocity consistency metric to benchmark methods in multi-lineage datasets is incorrect. In multi-lineage differentiation systems, cells (e.g., those in fate priming stages) may inherently show inconsistency in their velocity. Thus, it is difficult to distinguish inconsistency caused by estimation error from that arising from biological signals. Velocity consistency metrics are only appropriate in systems with unidirectional trajectories (e.g., cell cycling). The abnormally high consistency values here raise concerns about whether the estimated velocities meaningfully capture lineage differences.

      (c) The improvement of TSvelo over other methods in terms of cross-boundary direction correctness looks marginal; a statistical test would help to assess its significance.

      (d) Lines 177-178: Based on the figure, TSvelo does not appear to clearly distinguish cell types. A quantitative metric, such as Adjusted Rand Index (ARI), should be provided.

      (e) Lines 179-183: The claim that traditional methods cannot capture dynamics in the unspliced-spliced phase portrait is vague. What specific aspect is not captured-the fitted values or something else? Evidence is lacking. Please provide a detailed explanation and quantitative metrics to support this claim.

      (3) Application to gastrulation erythroid datasets

      (a) Lines 191-194: The observation that velocity genes are enriched for erythropoiesis-related pathways is trivial, since the analysis is restricted to highly variable genes (HVGs) from an erythropoiesis dataset. This enrichment is expected and therefore not informative.

      (b) Lines 227-228: It remains unclear how TSvelo "accurately captures the dynamics." What is the definition of dynamics in this context? Figure 3g shows unspliced/spliced vs. fitted time plots and phase portraits, but without a quantitative definition or measure, the claim of superiority cannot be supported. Visualization of a single gene is insufficient; a systematic and quantitative analysis is needed.

      (4) Application to the mouse brain and other datasets

      (a) Lines 280-281: The authors cannot claim that velocity streams are smoother in TSvelo than in Multivelo based solely on 2D visualization. Similarly, claiming that one model predicts the correct differentiation trajectory from a 2D projection is over-interpretation, as has been discussed in prior literature see PMID: 37885016.

      (b) Lines 304-306: Beyond transcriptional signal estimation, how is regulation inferred solely from scRNA-seq data validated, especially compared with scATAC-seq data? Are there cases where transcriptome-based regulatory inference is supported by epigenomic evidence, thereby demonstrating TSvelo's GRN inference accuracy?

      (c) The claim that TSvelo can model multi-lineage datasets hinges on its use of PAGA for lineage segmentation, followed by independent modeling of dynamics within each subset. However, the procedure for merging results across subsets remains unclear.

    4. Reviewer #3 (Public review):

      Despite the abundance of RNA velocity tools, there are still major limitations, and there is strong skepticism about the results these methods lead to. In this paper, the authors try to address some limitations of current RNA velocity approaches by proposing a unified framework to jointly infer transcriptional and splicing dynamics. The method is then benchmarked on 6 real datasets against the most popular RNA velocity tools.

      While the approach has the potential to be of interest for the field, and may present improvements compared to existing approaches, there are some major limitations that should be addressed, particularly concerning the benchmark (see major comment 1).

      Major comments:

      (1) My main criticism concerns the benchmarking: real data lack a ground truth, and are absolutely not ideal for comparing methods, because one can only speculate what results appear to be more plausible.<br /> A solid and extensive simulation study, which covers various scenarios and possibly distinct data-generating models, is needed for comparing approaches. The authors should check, for example, the simulation studies in the BayVel approach (Section 4, BayVel: A Bayesian Framework for RNA Velocity Estimation in Single-Cell Transcriptomics). Clearly, all methods should be included in the simulation.

      (2) Related to the above: since a ground truth is missing, the real data analyses need to be interpreted with caution. I recommend avoiding strong statements, such as "successfully captures the correct gene dynamics", or "accurately infer", in favour of milder statements supported by the data, such as "... aligns with the biological processes described" (as in page 12), or "results are compatible with current biological knowledge", etc...

      (3) Many methods perform RNA velocity analyses. While there is a brief description, I think it'd be useful to have a schematic summary (e.g., via a Table) of the main conceptual, mathematical, and computational characteristics of each approach.

      (4) Related to the above: I struggled to identify the main conceptual novelty of TSvelo, compared to existing approaches. I recommend explaining this aspect more extensively.

      (5) A computational benchmark is missing; I'd appreciate seeing the runtime and memory cost of all methods in a couple of datasets.

      (6) I think BayVel (mentioned above) should be added to the list of competing methods (both in the text and in the benchmarks). The package can be found here: https://github.com/elenasabbioni/BayVel_pkgJulia .

    5. Author response:

      Reviewer #1:

      We appreciate the reviewer’s positive assessment of TSvelo and their helpful technical comments. In the revised manuscript, we will:

      (1) Provide a clearer discussion of TF–target annotations, their limitations, and potential integration of additional databases.

      (2) Clarify the rationale for example-gene selection (e.g., in Fig. 2d).

      (3) Re-evaluate and temper the interpretation regarding ANXA4 and early-stage cell-cycle transitions.

      (4) Add appropriate references supporting neuronal inside-out migration.

      (5) Include additional analysis comparing TF-based transcription rate estimation with ATAC-based estimates from MultiVelo.

      (6) Clarify how lineages were determined in Fig. 6g and incorporate barcode-based validation where applicable.

      (7) Correct all typographical errors noted.

      Reviewer #2:

      We appreciate the reviewer’s careful examination of modeling, benchmarking, and interpretation. To address these concerns, we will:

      (1) Expand the methodological justification for initial-state selection, add simulations with ground truth, and evaluate U-to-S delay more broadly across genes.

      (2) Clarify matrix formulations and ensure consistency in notation (e.g., Eq. 8).

      (3) Assess robustness to prior-knowledge graphs and evaluate alternatives beyond ENCODE/ChEA.

      (4) Add methodological details on parameter search.

      (5) Improve benchmarking on pancreatic endocrine datasets by including clear definitions of velocity pseudotime, ARI for cell-type separation, quantitative evaluation of phase-portrait fits, and appropriate interpretation of consistency metrics for multi-lineage systems.

      (6) Reframe claims about “accurate” or “correct” predictions where evidence is qualitative and strengthen quantitative support where possible.

      (8) Clarify lineage segmentation and merging when applying PAGA-guided multi-lineage modeling.

      Reviewer #3:

      We thank the reviewer for highlighting the need for more rigorous benchmarking and conceptual clarity. In response, we will:

      (1) Conduct an expanded simulation study incorporating different data-generating models.

      (2) Revise all strong claims to more cautious, evidence-based language.

      (3) Add a concise table summarizing conceptual and computational differences among RNA-velocity frameworks.

      (4) More clearly articulate the conceptual novelty of TSvelo relative to existing approaches.

      (5) Include runtime and memory benchmarks across representative datasets.

      (6) Explore additional methods in conceptual comparisons and benchmarking analyses.We appreciate the reviewers’ thoughtful input and agree that the suggested analyses and clarifications will significantly improve the rigor and clarity of the manuscript. We will incorporate all recommended revisions in the resubmission and provide a full, detailed, point-by-point response at that time.

    1. eLife Assessment

      This valuable study investigates the role of P-bodies in yeast proliferation and mRNA regulation within the phyllosphere, proposing that P-body assembly contributes to methanol metabolism and stress adaptation. The findings are of interest to researchers studying post-transcriptional gene regulation and microbial ecology in plants. However, the evidence is incomplete, as most experiments were performed under artificial conditions, relied on limited genetic validation, and were supported primarily by qualitative or low-resolution imaging.

    2. Reviewer #1 (Public review):

      Summary:

      Stemming from the previous research on the adaptation of methylotrophic microbes in the phyllosphere environment, this paper tested a novel hypothesis on the molecular and cellular mechanisms by which yeast uses biomolecular condensates as unique niches for the regulation of methanol-induced mRNAs. While a few in vivo experiments were conducted in the phyllosphere, more assays were carried out on plates to mimic various stress conditions, diminishing the reliability of the conclusions in supporting the main hypothesis.

      Strengths:

      This study addressed an interesting and important biological question. Some of the experiments were conducted methodically and carefully. The visualization of both the biomolecular condensates and the mRNAs was helpful in addressing the questions. The results are expected to be useful in paving the way for the future study to directly test its main hypothesis. The results of this study could also have a general implication for the adaptation of a huge population of microbes in the enormous space of the phyllosphere on Earth.

      Weaknesses:

      The results were often over- and misinterpreted. Given mthat any hypotheses were tested indirectly on plates, the correlative results could only be used to carefully suggest the likelihood of the hypotheses. For example, a single edc3 mutant was used to represent a P-body-defective strain, although it is well known that EDC3 is a critical component in mRNA decapping; hence, the mutant should display a pleiotropic phenotype, rather than a mere reduced P-body phenotype. Using a similar reductionist approach, the study went on to employ a series of plate assays to argue that the conditions were mimicking the phyllosphere, which could be misleading under these circumstances. Furthermore, the low percentage of the colocalization between P-bodies and mimRNA granules and the similar results from negative control mRNAs do not convincingly support the idea that mimRNAs are sequestered between two biomolecular condensates, and P-bodies could serve as regulatory hubs. Given that the abundance of mimRNA granules was positively correlated with the transcript abundance of mimRNAs, and P-body abundance did not change too much under methanol induction, the results could not support an active mimRNA sequestration mechanism from mimRNA granules to P-bodies with a proportional increase of the overlap between the two condensates. More direct experiments conducted in the phyllosphere using multiple P-body defective yeast strains should strengthen the manuscript, assuming all the results turned out to be supportive.

    3. Reviewer #2 (Public review):

      Summary:

      This article aims to elucidate the potential roles of P-bodies in yeast adaptation to complex environmental conditions, such as the plant leaf phyllosphere. The authors demonstrated that yeast mutants defective in one of the P-body-localized proteins failed to grow in the Arabidopsis thaliana phyllosphere. They conducted detailed imaging analyses, focusing particularly on the co-localization of P-bodies and mRNAs (DAS1) related to the methanol metabolism pathway under various environmental conditions. The study newly revealed that these mRNAs form dot-like structures that occasionally co-localize with a P-body marker. Furthermore, the authors showed that the number of P-body-labeled dots increases under stress conditions, such as H₂O₂ treatment, and that mRNA dots are more frequently localized to P-body-like structures. Based on these detailed observations, the authors hypothesize that P-bodies function to protect mRNAs from degradation, particularly under stress conditions.

      Strengths:

      I think the authors' attempt to elucidate the potential roles of P-bodies in yeast under stress conditions is novel, and the imaging data are overall very nice.

      Weaknesses:

      I believe the authors could make additional efforts to more clearly demonstrate that P-bodies are indeed required for yeast proliferation in the phyllosphere, as described below, since this represents the most novel aspect of the study.

    4. Reviewer #3 (Public review):

      Summary:

      The authors use fluorescent microscopy and fluorescent markers to investigate the requirement of P-bodies during growth on methanol, a common substrate available on plant leaves, by using a yeast edc3 mutant defective in P-body formation. Growth on methanol upregulates the transcription of methanol metabolic genes, which accumulate in granular structures, as observed by microscopy. Co-localization of P-bodies and granules was quantified and described as dynamically enhanced during oxidative stress. Ultimately, the authors suggest a model where methanol induces the accumulation of methanol-induced mRNAs in cytosolic granules, which dynamically interact with P-bodies, especially during oxidative stress, to protect the mRNAs from degradation. However, this model is not strongly supported by the provided data, as the quantification of the co-localization between different markers (of organelles and between P-body and granules) is not well presented or described in the text.

      Considering that there is only a small EDC3-dependent overlap between P-bodies and mimRNA granules, the claim that P-bodies regulate mimRNAs is not fully justified. Rather, EDC3 could also be involved in mimRNA granule formation, independent of P-bodies.

      Strengths:

      (1) The authors could show convincingly that P-bodies (using a P-body-deficient edc3-KO strain) are important for colonizing the plant phyllosphere and for the regulation of methanol-induced mRNAs (mimRNA).

      (2) The visualization of mimRNA granules and P-bodies using fluorescent markers is interesting and was validated by alternative methods, such as FISH staining.

      (3) The dynamic formation of mimRNA granules and P-bodies was demonstrated during growth on leaves and in artificial medium during oxidative stress. The mimRNA granules showed a similar dynamic as the abundances of several mimRNAs and their corresponding proteins.

      (4) A role of EDC3 in the formation of mimRNA granules was demonstrated. However, the link between P-bodies and mimRNA granules was not clearly shown.

      Weaknesses:

      (1) The study largely relies on fluorescent microscopy and co-localization measurements. However, the subcellular resolution is not very high; it is unclear how dot-like structures were measured and, importantly, how co-localization was quantified.

      (2) The text does not clarify to what degree P-bodies and mimRNA granules are different structures. Based on the images, the size of P-bodies and granules seems to be vastly different, making it unclear whether these structures are fused or separate, even if their markers are reported to overlap.

      (3) The evidence that mimRNA granules contain ribosome-free and ribosome-associated RNA is only based on inhibitors and microscopy, without providing further evidence measuring granule content by isolation and sequencing approaches.

      (4) Similarly, the co-localization with other organelle markers is not supported by quantitative data.

    1. eLife Assessment

      This fundamental study presents experimental evidence on how geomagnetic and visual cues are integrated in a nocturnally migrating insect. The evidence supporting the conclusions is compelling. The work will be of broad interest to researchers studying animal migration and navigation.

    2. Reviewer #1 (Public review):

      Summary

      The manuscript by Ma et al. provides robust and novel evidence that the noctuid moth Spodoptera frugiperda (Fall Armyworm) possesses a complex compass mechanism for seasonal migration that integrates visual horizon cues with Earth's magnetic field (likely its horizontal component). This is an important and timely study: apart from the Bogong moth, no other nocturnal Lepidoptera has yet been shown to rely on such a dual-compass system. The research therefore expands our understanding of magnetic orientation in insects with both theoretical (evolution and sensory biology) and applied (agricultural pest management, a new model of magnetoreception) significance.

      The study uses state-of-the-art methods and presents convincing behavioural evidence for a multimodal compass. It also establishes the Fall Armyworm as a tractable new insect model for exploring the sensory mechanisms of magnetoreception, given the experimental challenges of working with migratory birds. Overall, the experiments are well-designed, the analyses are appropriate, and the conclusions are generally well supported by the data.

      Strengths

      (1) Novelty and significance: First strong demonstration of a magnetic-visual compass in a globally relevant migratory moth species, extending previous findings from the Bogong moth and opening new research avenues in comparative magnetoreception.

      (2) Methodological robustness: Use of validated and sophisticated behavioural paradigms and magnetic manipulations consistent with best practices in the field. The use of 5-minute bins to study the dynamic nature of the magnetic compass which is anchored to a visual cue but updated with a latency of several minutes, is an important finding and a new methodological aspect in insect orientation studies.

      (3) Clarity of experimental logic: The cue-conflict and visual cue manipulations are conceptually sound and capable of addressing clear mechanistic questions.

      (4) Ecological and applied relevance: Results have implications for understanding migration in an invasive agricultural pest with an expanding global range.

      (5) Potential model system: Provides a new, experimentally accessible species for dissecting the sensory and neural bases of magnetic orientation.

      Weaknesses

      While the study is strong overall, several recommendations should be addressed to improve clarity, contextualisation, and reproducibility:

      (1) Structure and presentation of results

      Requires reordering the visual-cue experiments to move from simpler (no cues) to more complex (cue-conflict) conditions, improving narrative logic and accessibility for non-specialists.

      (2) Ecological interpretation

      (a) The authors should discuss how their highly simplified, static cue setup translates to natural migratory conditions where landmarks are dynamic, transient or absent.

      (b) Further consideration is required regarding how the compass might function when landmarks shift position, are obscured, or are replaced by celestial cues. Also, more consolidated (one section) and concrete suggestions for future experiments are needed, with transient, multiple, or more naturalistic visual cues to address this.

      (3) Methodological details and reproducibility

      (a) It would be better to move critical information (e.g., electromagnetic noise measurements) from the supplementary material into the main Methods.

      (b) Specifying luminance levels and spectral composition at the moth's eye is required for all visual treatments.

      (c) Details are needed on the sex ratio/reproductive status of tested moths, and a map of the experimental site and migratory routes (spring vs. fall) should be included.

      (d) Expanding on activity-level analyses is required, replacing "fatigue" with "reduced flight activity," and clarifying if such analyses were performed.

      (4) Figures and data presentation

      (a) The font sizes on circular plots should be increased; compass labels (magnetic North), sample sizes, and p-values should be included.

      (b) More clarity is required on what "no visual cue" conditions entail, and schematics or photos should be provided.

      (c) The figure legends should be adjusted for readability and consistency (e.g., replace "magnetic South" with magnetic North, and for box plots better to use asterisks for significance, report confidence intervals).

      (5) Conceptual framing and discussion

      (a) Generalisations across species should be toned down, given the small number of systems tested by overlapping author groups.

      (b) It requires highlighting that, unlike some vertebrates, moths require both magnetic and visual cues for orientation.

      (c) It should be emphasised that this study addresses direction finding rather than full navigation.

      (d) Future Directions should be integrated and consolidated into one coherent subsection proposing realistic next steps (e.g., more complex visual environments, temporal adaptation to cue-field relationships).

      (e) The limitations should be better discussed, due to the artificiality of the visual cue earlier in the Discussion.

      (6) Technical and open-science points

      • Appropriate circular statistics should be used instead of t-tests for angular data shown in the supplementary material.

      • Details should be provided on light intensities, power supplies, and improvements to the apparatus.

      • The derivation of individual r-values should be clarified.

      • Share R code openly (e.g., GitHub).

      • Some highly relevant - yet missing - recent and relevant citations should be added, and some less relevant ones removed.

    3. Reviewer #2 (Public review):

      Summary:

      This work provided experimental evidence on how geomagnetic and visual cues are integrated, and visual cues are indispensable for magnetic orientation in the nocturnal fall armyworm.

      Strengths:

      Although it has been demonstrated previously that the Australian Bogon moth could integrate global stellar cues with the geomagnetic field for long-distance navigation, the study presented in this manuscript is still fundamentally important to the field of magnetoreception and sensory biology. It clearly shows that the integration of geomagnetic and visual cues may represent a conserved navigational mechanism broadly employed across migratory insects. I find the research very important, and the results are presented very well.

      Weaknesses:

      The authors developed an indoor experimental system to study the influence of magnetic fields and visual cues on insect orientation, which is certainly a valuable approach for this field. However, the ecological relevance of the visual cue may be limited or unclear based on the current version. The visual cues were provided "by a black isosceles triangle (10 cm high, 10 cm 513 base) made from black wallpaper and fixed to the horizon at the bottom of the arena". It is difficult to conceive how such a stimulus (intended to represent a landmark like a mountain) could provide directional information for LONG-DISTANCE navigation in nocturnal fall armyworms, particularly given that these insects would have no prior memory of this specific landmark. It might be a good idea to make a more detailed explanation of this question.

    1. eLife Assessment

      This important work introduces a family of interpretable Gaussian process models that allows us to learn and model sequence-function relationships in biomolecules. These models are applied to three recent empirical fitness landscapes, providing convincing evidence of their predictive power. The findings should be of interest to the community working on the sequence-function relationship, on epistasis, and on fitness landscapes.

    2. Reviewer #1 (Public review):

      Summary:

      Zhou and colleagues introduce a series of generalized Gaussian process models for genotype-phenotype mapping. The goal was to develop models that were more powerful than standard linear models, while retaining explanatory power as opposed to neural network approaches. The novelty stems from choices of prior distributions (and I suppose fitted posteriors) that model epistasis based on some form of site/allele-specific modifier effect and genotype distance. The authors then apply their models to three empirical datasets, the GB1 antibody-binding dataset, the human 5' splice set dataset, and a yeast meiotic cross dataset, and find substantially improved variance explained while retaining strong explanatory power when compared to linear models.

      Strengths:

      The main strength of the manuscript lies in the development of the modeling approaches, as well as the evidence from the empirical dataset that the variance explained is improved.

      Weaknesses:

      The main weakness of the paper is that none of the models were tested on an in silico dataset where the ground truth is known. Therefore, it is unclear if their model actually retains any explanatory power.

      Impact:

      Genotype-phenotype mapping is a central point of genetics. However, the function is complex and unknown. Simple linear models can uncover some functional link between genes and their effects, but do so through severe oversimplification of the system. On the other hand, neural networks can, in principle, model the function perfectly, but it does so without easy interpretation. Gaussian regression is another approach that improves on linear regression, allowing better fitting of the data while allowing interpretation of the underlying alleles and their effects. This approach, now computable with state-of-the-art algorithms, will advance the field of genotype-to-phenotype associations.

    3. Reviewer #2 (Public review):

      This paper builds on prior work by some of the same authors on how to model fitness landscapes in the presence of epistasis. They have previously shown how simply writing general expansions of fitness in terms of one-body plus two-body plus three-body, etc., terms often fails to generalize to good predictions. They have also previously introduced a Gaussian process regression approach regarding how much epistasis there should be of each order.

      This paper contains several main advances:

      (1) They implement a more efficient form of the Gaussian process model fitting that uses GPUs and related algorithmic advances to enable better fitting of these models to datasets for larger sequences.

      (2) They provide a software package implementing the above.

      (3) They generalize the models to allow the extent of epistasis associated with changes in sequence to depend on specific sites, alleles, and mutations.

      (4) They show modest improvements in prediction and substantial improvements in interpretability with the more generalized models above.

      Overall, while this paper is quite technical, my assessment is that it represents a substantial conceptual and algorithmic advance for the above reasons, and I would recommend only modest revisions. The paper seems well-written and clear, given the inherent complexity of this topic.

    4. Reviewer #3 (Public review):

      Summary:

      The authors propose three types of Gaussian process kernels that extend and generalize standard kernels used for sequence-function prediction tasks, giving rise to the connectedness, Jenga, and general product models. The associated hyperparameters are interpretable and represent epistatic effects of varying complexity. The proposed models significantly outperform the simpler baselines, including the additive model, pairwise interaction model, and Gaussian process with a geometric kernel, in terms of R^2.

      Strengths:

      (1) The demonstrated performance boost and improved scaling with increasing training data are compelling.

      (2) The hyperparameter selection step using the marginal likelihood, as implemented by the authors, seems to yield a reasonable hyperparameter combination that lends itself to biologically plausible interpretations.

      (3) The proposed kernels generalize existing kernels in domain-interpretable ways, and can correspond to cases that would not be "physical" in the original models (e.g., $\mu_p>1$ in the original connectedness model that allows modeling of anticorrelated phenotypes).

      Weaknesses:

      (1) While enabling uncertainty quantification is a key advantage of Gaussian processes, the authors do not present metrics specific to the predicted uncertainties; all metrics seem to concern the mean predictions only. It would be helpful to evaluate coverage metrics and maybe include an application of the uncertainties, such as in active learning or Bayesian optimization.

      (2) The more complex models, like the general product model, place a heavier burden on the hyperparameter selection step. Explicitly discussing the optimization routine used here would be helpful to potential users of the method and code.

    1. eLife Assessment

      This important study describes a novel Bayesian psychophysical approach that efficiently measures how well humans can discriminate between colors across the entire isoluminant plane. The evidence was considered compelling, as it included successful model validation against hold-out data and published datasets. This approach could prove to be of use to color vision scientists, as well as to those who use computational psychophysics and attempt to model perceptual stimulus fields with smooth variations over coordinate spaces.

    2. Reviewer #1 (Public review):

      Summary:

      This paper presents an ambitious and technically impressive attempt to map how well humans can discriminate between colours across the entire isoluminant plane. The authors introduce a novel Wishart Process Psychophysical Model (WPPM) - a Bayesian method that estimates how visual noise varies across colour space. Using an adaptive sampling procedure, they then obtain a dense set of discrimination thresholds from relatively few trials, producing a smooth, continuous map of perceptual sensitivity. They validate their procedure by comparing actual and predicted thresholds at an independent set of sample points. The work is a valuable contribution to computational psychophysics and offers a promising framework for modelling other perceptual stimulus fields more generally.

      Strengths:

      The approach is elegant and well-described (I learned a lot!), and the data are of high quality. The writing throughout is clear, and the figures are clean (elegant in fact) and do a good job of explaining how the analysis was performed. The whole paper is tremendously thorough, and the technical appendices and attention to detail are impressive (for example, a huge amount of data about calibration, variability of the stim system over time, etc). This should be a touchstone for other papers that use calibrated colour stimuli.

      Weaknesses:

      Overall, the paper works as a general validation of the WPPM approach. Importantly, the authors validate the model for the particular stimuli that they use by testing model predictions against novel sample locations that were not part of the fitting procedure (Figure 2). The agreement is pretty good, and there is no overall bias (perhaps local bias?), but they do note a statistically-significant deviation in the shape of the threshold ellipses. The data also deviate significantly from historical measurements, and I think the paper would be considerably stronger with additional analyses to test the generality of its conclusions and to make clearer how they connect with classical colour vision research. In particular, three points could use some extra work:

      (1) Smoothness prior.<br /> The WPPM assumes that perceptual noise changes smoothly across colour space, but the degree of smoothness (the eta parameter) must affect the results. I did not see an analysis of its effects - it seems to be fixed at 0.5 (line 650). The authors claim that because the confidence intervals of the MOCS and the model thresholds overlap (line 223), the smoothing is not a problem, but this might just be because the thresholds are noisy. A systematic analysis varying this parameter (or at least testing a few other values), and reporting both predictive accuracy and anisotropy magnitude, would clarify whether the model's smoothness assumption is permitting or suppressing genuine structure in the data. Is the gamma parameter also similarly important? In particular, does changing the underlying smoothness constraint alter the systematic deviation between the model and the MOCS thresholds? The authors have thought about this (of course! - line 224), but also note a discrepancy (line 238). I also wonder if it would be possible to do some analysis on the posterior, which might also show if there are some regions of color space where this matters more than others? The reason for doing this is, in part, motivated by the third point below - it's not clear how well the fits here agree with historical data.

      (2) Comparison with simpler models. It would help to see whether the full WPPM is genuinely required. Clearly, the data (both here and from historical papers) require some sort of anisotropy in the fitting - the sensitivities decrease as the stimuli move away from the adaptation point. But it's >not< clear how much the fits benefit from the full parameterisation used here. Perhaps fits for a small hierarchy of simpler models - starting with isotropic Gaussian noise (as a sort of 'null baseline') and progressing to a few low-dimensional variants - would reveal how much predictive power is gained by adding spatially varying anisotropy. This would demonstrate that the model's complexity is justified by the data.

      (3) Quantitative comparison to historical data. The paper currently compares its results to MacAdam, Krauskopf & Karl, and Danilova & Mollon only by visual inspection. It is hard to extract and scale actual data from historical papers, but from the quality of the plotting here, it looks like the authors have achieved this, and so quantitative comparisons are possible. The MacAdam data comparisons are pretty interesting - in particular, the orientations of the long axes of the threshold ellipses do not really seem to line up between the two datasets - and I thought that the orientation of those ellipses was a critical feature of the MacAdam data. Quantitative comparisons (perhaps overall correlations, which should be immune to scaling issues, axis-ratio, orientation, or RMS differences) would give concrete measures of the quality of the model. I know the authors spend a lot of time comparing to the CIE data, and this is great.... But re-expressing the fitted thresholds in CIE or DKL coordinates, and comparing them directly with classical datasets, would make the paper's claims of "agreement" much more convincing.

      Overall, this is a creative and technically sophisticated paper that will be of broad interest to vision scientists. It is probably already a definitive methods paper showing how we can sample sensitivity accurately across colour space (and other visual stimulus spaces). But I think that until the comparison with historical datasets is made clear (and, for example, how the optimal smoothness parameters are estimated), it has slightly less to tell us about human colour vision. This might actually be fine - perhaps we just need the methods?

      Related to this, I'd also note that the authors chose a very non-standard stimulus to perform these measurements with (a rendered 3D 'Greebley' blob). This does have the advantage of some sort of ecological validity. But it has the significant >disadvantage< that it is unlike all the other (much simpler) stimuli that have been used in the past - and this is likely to be one of the reasons why the current (fitted) data do not seem to sit in very good agreement with historical measurements.

    3. Reviewer #2 (Public review):

      Summary:

      Hong et al. present a new method that uses a Wishart process to dramatically increase the efficiency of measuring visual sensitivity as a function of stimulus parameters for stimuli that vary in a multidimensional space. Importantly, they have validated their model against their own hold-out data and against 3 published datasets, as well as against colour spaces aimed at 'perceptual uniformity' by equating JNDs. Their model achieves high predictive success and could be usefully applied in colour vision science and psychophysics more generally, and to tackle analogous problems in neuroscience featuring smooth variation over coordinate spaces.

      Strengths:

      (1) This research makes a substantial contribution by providing a new method to very significantly increase the efficiency with which inferences about visual sensitivity can be drawn, so much so that it will open up new research avenues that were previously not feasible. Secondly, the methods are well thought out and unusually robust. The authors made a lot of effort to validate their model, but also to put their results in the context of existing results on colour discrimination, transforming their results to present them in the same colour spaces as used by previous authors to allow direct comparisons. Hold-out validation is a great way to test the model, and this has been done for an unusually large number of observers (by the standards of colour discrimination research). Thirdly, they make their code and materials freely available with the intention of supporting progress and innovation. These tools are likely to be widely used in vision science, and could of course be used to address analogous problems for other sensory modalities and beyond.

      Weaknesses:

      It would be nice to better understand what constraints the choice of basis functions puts on the space of possible solutions. More generally, could there be particular features of colour discrimination (e.g., rapid changes near the white point) that the model captures less well? The substantial individual differences evident in Figure S20 (comparison with Krauskopf and Gegenfurtner, 1992) are interesting in this context. Some observers show radial biases for the discrimination ellipses away from the white point, some show biases along the negative diagonal (with major axes oriented parallel to the blue-yellow axis), and others show a mixture of the two biases. Are these genuine individual differences, or could the model be performing less accurately in this desaturated region of colour space?

    4. Reviewer #3 (Public review):

      Summary:

      This study presents a powerful and rigorous approach for characterizing stimulus discriminability throughout a sensory manifold, and is applied to the specific context of predicting color discrimination thresholds across the chromatic plane.

      Strengths:

      Color discrimination has played a fundamental role in studies of human color vision and for color applications, but as the authors note, it remains poorly characterized. The study leverages the assumption that thresholds should vary smoothly and systematically within the space, and validates this with their own tests and comparisons with previous studies.

      Weaknesses:

      The paper assumes that threshold variations are due to changes in the level of intrinsic noise at different stimulus levels. However, it's not clear to me why they could not also be explained by nonlinearities in the responses, with fixed noise. Indeed, most accounts of contrast coding (which the study is at least in part measuring because the presentation kept the adapt point close to the gray background chromaticity, and thus measured increment thresholds), assume a nonlinear contrast response function, which can at least as easily explain why the thresholds were higher for colors farther from the gray point. It would be very helpful if a section could be added that explains why noise differences rather than signal differences are assumed and how these could be distinguished. If they cannot, then it would be better to allow for both and refer to the variation in terms of S/N rather than N alone.

      Related to this point, the authors note that the thresholds should depend on a number of additional factors, including the spatial and temporal properties and the state of adaptation. However, many of these again seem to be more likely to affect the signal than the noise.

      An advantage of the approach is that it makes no assumptions about the underlying mechanisms. However, the choice to sample only within the equiluminant plane is itself a mechanistic assumption, and these could potentially be leveraged for deciding how to sample to improve the characterization and efficiency. For example, given what we know about early color coding, would it be more (or less) efficient to select samples based on a DKL space, etc?

    1. eLife Assessment

      This valuable study demonstrates that self-motion strongly affects neural responses to visual stimuli, comparing humans moving through a virtual environment to passive viewing. However, evidence that the modulation is due to prediction is incomplete as it stands, since participants may come to expect visual freezes over the course of the experiment. This study bridges human and rodent studies on the role of prediction in sensory processing, and is therefore expected to be of interest to a large community of neuroscientists.

    2. Reviewer #1 (Public review):

      In this paper, the authors wished to determine human visuomotor mismatch responses in EEG in a VR setting. Participants were required to walk around a virtual corridor, where a mismatch was created by halting the display for 0.5s. This occurred every 10-15 seconds. They observe an occipital mismatch signal at 180 ms. They determine the specificity of this signal to visuomotor mismatch by subsequently playing back the same recording passively. They also show qualitatively that the mismatch response is larger than one generated in a standard auditory oddball paradigm. They conclude that humans therefore exhibit visuomotor mismatch responses like mice, and that this may provide an especially powerful paradigm for studying prediction error more generally.

      Asking about the role of visuomotor prediction in sensory processing is of fundamental importance to understanding perception and action control, but I wasn't entirely sure what to conclude from the present paradigm or findings. Visuomotor prediction did not appear to have been functionally isolated. I hope the comments below are helpful.

      (1) First, isolating visuomotor prediction by contrasting against a condition where the same video stream is played back subsequently does not seem to isolate visuomotor prediction. This condition always comes second, and therefore, predictability (rather than specifically visuomotor predictability) differs. Participants can learn to expect these screen freezes every 10-15 s, even precisely where they are in the session, and this will reduce the prediction error across time. Therefore, the smaller response in the passive condition may be partly explained by such learning. It's impossible to fully remove this confound, because the authors currently play back the visual specifics from the visuomotor condition, but given that the visuomotor correspondences are otherwise pretty stable, they could have an additional control condition where someone else's visual trace is played back instead of their own, and order counterbalanced. Learning that the freezes occur every 10-15 s, or even precisely where they occur, therefore, could not explain condition differences. At a minimum, it would be nice to see the traces for the first and second half of each session to see the extent to which the mismatch response gets smaller. This won't control for learning about the specific separations of the freezes, but it's a step up from the current information.

      (2) Second, the authors admirably modified their visual-only condition to remove nausea from 6 df of movement (3D position, pitch, yaw, and roll). However, despite the fact it's far from ideal to have nauseous participants, it would appear from the figures that these modifications may have changed the responses (despite some pairwise lack of significance with small N). Specifically, the trace in S3 (6DOF) and 2E look similar - i.e., comparing the visuomotor condition to the visual condition that matches. Mismatch at 4/5 microvolts in both. Do these significantly differ from each other?

      (3) It generally seems that if the authors wish to suggest that this paradigm can be used to study prediction error responses, they need to have controlled for the actions performed and the visual events. This logic is outlined in Press, Thomas, and Yon (2023), Neurosci Biobehav Rev, and Press, Kok, and Yon (2020) Trends Cogn Sci ('learning to perceive and perceiving to learn'). For example, always requiring Ps to walk and always concurrently playing similar visual events, but modifying the extent to which the visual events can be anticipated based on action. Otherwise, it seems more accurately described as a paradigm to study the influence of action on perception, which will be generated by a number of intertwined underlying mechanisms.

      More minor points:

      (1) I was also wondering whether the authors may consider the findings in frontal electrodes more closely. Within the statistical tests of the frontal electrodes against 0, as displayed in Figure 3c, the insignificance of the effect of Fp2 seems attributable to the small included sample size of just 13 participants for this electrode, as listed in Table S1, in combination with a single outlier skewing the result. The small sample size stands out especially in comparison to the sample size at occipital electrodes, which is double and therefore enjoys far more statistical power. It looks like the selected time window is not perfectly aligned for determining a frontal effect, and also the distribution in 3B looks like responses are absent in more central electrodes but present in occipital and frontal ones. I realise the focus of analysis is on visual processing, but there are likely to be researchers who find the frontal effect just as interesting.

      (2) It is claimed throughout the manuscript that the 'strongest predictor (of sensory input) - by consistency of coupling - is self-generated movement'. This claim is going to be hard to validate, and I wonder whether it might be received better by the community to be framed as an especially strong predictor rather than necessarily the strongest. If I hear an ambulance siren, this is an especially strong predictor of subsequent visual events. If I see a traffic light turn red, then yellow, I can be pretty certain what will happen next. Etc.

      (3) The checkerboard inversion response at 48 ms is incredibly rapid. Can the authors comment more on what may drive this exceptionally fast response? It was my understanding that responses in this time window can only be isolated with human EEG by presenting spatially polarized events (cf. c1, e.g., Alilovic, Timmermans, Reteig, van Gaal, Slagter, 2019, Cerebral Cortex)

    3. Reviewer #2 (Public review):

      Summary:

      This study investigates whether visuomotor mismatch responses can be detected in humans. By adapting paradigms from rodent studies, the authors report EEG evidence of mismatch responses during visuomotor conditions and compare them to visual-only stimulation and mismatch responses in other modalities.

      Strengths:

      (1) The authors use a creative experimental design to elicit visuomotor mismatch responses in humans.

      (2) The study provides an initial dataset and analytical framework that could support future research on human visuomotor prediction errors.

      Weaknesses:

      (1) Methodological issues (e.g., volume conduction, channel selection, lack of control for eye movements) make it difficult to confidently attribute the observed mismatch responses to activity in visual cortical regions.

      (2) A very large portion of the data was excluded due to motion artefacts, raising concerns about statistical power and representativeness. The criteria for trial inclusion and the number of accepted trials per participant appear arbitrary and not justified with reference to EEG reliability standards.

      (3) The comparison across sensory modalities (e.g., auditory vs. visual mismatch responses) is conceptually interesting, but due to the choice of analyzing auditory mismatch responses over occipital channels, it has limited interpretability.

      The authors successfully demonstrate that visuomotor mismatch paradigms can, in principle, be applied in human EEG. However, due to the issues outlined above, the current findings are relatively preliminary. If validated with improved methodology, this approach could significantly advance our understanding of predictive processing in the human visual system and provide a translational bridge between rodent and human work.

    4. Reviewer #3 (Public review):

      Summary:

      Solyga, Zelechowski, and Keller present a concise report of an innovative study demonstrating clear visuomotor mismatch responses in ambulating humans, using a mobile EEG setup and virtual reality. Human subjects walked around a virtual corridor while EEGs were recorded. Occasionally, motion and visual flow were uncoupled, and this evoked a mismatch response that was strongest in occipitally placed electrodes and had a considerable signal-to-noise ratio. It was robust across participants and could not be explained by the visual stimulus alone.

      Strengths:

      This is an important extension of their prior work in mice, and represents an elegant translation of those previous findings to humans, where future work can inform theories of e.g., psychiatric diseases that are believed to involve disordered predictive processing. For the most part, the authors are appropriately circumspect in their interpretations and discussions of the implications. I found the discussion of the polarity differences they found in light of separate positive and negative prediction errors, intriguing.

      Weaknesses:

      The primary weaknesses rest in how the results are sold and interpreted.

      Most notably, the interpretation of the results of the comparison of visuomotor mismatches to the passive auditory oddball induced mismatch responses is inappropriate, as suboptimal electrode choices, unclear matching of trial numbers, and other factors. To clarify, regarding the auditory oddball portion in Figure 5, the data quality is a concern for the auditory ERPs, and the choice of Occipital electrodes is a likely culprit. Typically, auditory evoked responses are maximal at Cz or FCz, although these contacts don't seem to be available with this setup. In general, caution is warranted in comparing ERP peaks between two different sensory modalities - especially if attention is directed elsewhere (to a silent movie) during one recording and not during the other. The authors discuss this as a purely "qualitative" comparison in the text, which is appreciated, and do acknowledge the limitations within the results section, but the figure title and, importantly, the abstract set a different tone. At least, for comparisons between auditory mismatch and visuomotor mismatch, trial numbers need to be equated, as ERP magnitude can be augmented by noise (which reduces with increased numbers of trials in the average). And more generally, the size of the mismatch event at the scalp does not scale one-to-one with the size at the level of the neural tissue. One can imagine a number of variables that impact scalp level magnitudes, which are orthogonal to actual cortex-level activation - the size, spread, and polarity variance of the activated source (which all would diminish amplitude at the scalp due to polyphasic summation/cancelation). The variance of phase to a stimulus across trials (cross trial phase locking) vs magnitude of underlying power - the former, in theory, relates to bottom-up activity and the latter can reflect feedback (which has more variability in time across trials; the distance of the scalp electrode from the activated tissue (which, for the auditory system, would be larger (FCz to superior temporal gyrus) than for the visual system (O1 to V1/2)). None of this precludes the inclusion of the auditory mismatch, which is a strength of the study, but interpretations about this supporting a supremacy of sensory-motor mismatch - regardless of validity - are not warranted. I would recommend changing the way this is presented in the abstract.

      Otherwise, the data are of adequate quality to derive most of their conclusions.

      The authors claim that the mismatch responses emanate from within the occipital cortex, but I would require denser scalp coverage or a demonstration of consistent impedances across electrodes and across subjects to make conclusions about the underlying cortical sources (especially given the latencies of their peaks). In EEG, the distribution of voltage on the scalp is, of course, related to but not directly reflective of the distribution of the underlying sources. The authors are mostly careful in their discussion of this, but I would strongly recommend changing the work choice of "in occipital cortex" to "over occipital cortex" or even "posteriorly distributed". Even with very dense electrode coverage and co-registration to MRIs for the generation of forward models that constrain solutions, source localization of EEG signals is very challenging and not a simple problem. Given the convoluted and interior nature of human V1, the ability to reliably detect early evoked responses (which show the mismatch in mouse models) at the scalp in ERP peaks is challenging - especially if one is collapsing ERPs across subjects. And - given the latency of the mismatch responses, I'd imagine that many distributed cortical regions contribute to the responses seen at the scalp.

      I think that Figure 3C, but as a difference of visual mismatch vs halting flow alone (in the open loop) might be additionally informative, as it clarifies exactly where the pure "mismatch" or prediction error is represented.

      As a suggestion, the authors are encouraged to analyse time-frequency power and phase locking for these mismatch responses, as is common in much of the literature (see Roach et al 2008, Schizophrenia Bulletin). This is not to say that doing so will yield insights into oscillations per se, but converting the data to the time-frequency domain provides another perspective that has some advantages. It fosters translations to rodent models, as ERP peaks do not map well between species, but e.g., delta-theta power does (see Lee et al 2018, Neuropsychopharmacology; Javitt et al 2018, Schizophrenia research; Gallimore et al 2023, Cereb Ctx). Further, ERP peaks can be influenced by the actual neuroanatomy of an individual (especially for quantifying V1 responses). Time frequency analyses may aid in interpreting the "early negative deflection with a peak latency of 48 ms " finding as well.

      Finally, the sentence in the abstract that this paradigm " can trigger strong prediction error responses and consequently requires shorter recording 20 times would simplify experiments in a clinical setting" is a nice setup to the paper, but the very fact that one third of recordings had to be removed due to movement artifact, and that hairstyle modulates the recording SnR, is reason that this paradigm, using the reported equipment, may have limited clinical utility in its current form. Further, auditory oddball paradigms are of great clinical utility because they do not require explicit attention and can be recorded very quickly with no behavioral involvement of a hospitalized patient. This should be discussed, although it does not detract from the overall scientific importance of the study. The authors should reconsider putting this statement in the abstract.

    1. eLife assessment

      This meta-analysis provides a fundamental synthesis of evidence demonstrating that transcranial magnetic stimulation targeting the hippocampal-cortical network reliably enhances episodic memory performance across diverse study designs. The evidence is convincing, with rigorous methodology and consistent effects observed despite modest sample sizes and some heterogeneity in stimulation approaches. The work highlights the specificity of memory improvements to hippocampal-dependent memories and identifies key methodological factors-such as individualized targeting-that influence efficacy. Overall, this study offers a timely and integrative framework that will inform both basic memory research and the design of future clinical trials for cognitive enhancement.

    2. Reviewer #1 (Public review):

      Summary:

      Goicoechea et al. conducted a timely and thorough meta-analysis on the potential for indirect hippocampal targeted transcranial magnetic stimulation (TMS) to improve episodic memory. The authors included additional factors of interest in their meta-analysis, which can be used to inform the next generation of studies using this intervention. Their analysis revealed critical factors for consideration: TMS should be applied pre-encoding, individualized spatial targeting improves efficacy, and improvement of recollection was stronger than recognition.

      Strengths:

      As mentioned previously, the meta-analysis is timely and summarizes an emerging set of studies (over the past decade since Wang et al., Science 2014). Those outside of the field may not be aware of the robustness of improvements in episodic memory from hippocampal targeted TMS. The authors were quite thorough in including additional factors that are important for the interpretation of these findings. These factors also address the differences in approach across studies. The evidence that individualized spatial targeting improves TMS efficacy is consistent with recent advances in TMS for major depressive disorder. The specificity of the cognitive improvements to recollection of episodic memory and not for other cognitive domains is consistent with hippocampal targeting. The authors also plan to post the complete dataset on an open-source repository, which enables additional analysis by other researchers.

      Weaknesses:

      The write-up is succinct and emphasizes the scientific decisions that underlie key differences in the various experimental designs. While the manuscript is written for a scientific audience, the authors are likely aware that findings like this will be of broad appeal to the field of neurology, where treatments for memory loss are desperately needed. For this reason, the authors could consider including a statement regarding an interpretation of this meta-analysis from a clinical standpoint. Statements such as 'safe and effective' imply a clinical indication, and yet the manuscript does not engage with clinical trials terminology such as blinding, parallel arm versus crossover design, and trial phase. While the authors might prefer not to engage with this terminology, it can be confusing when studies delivering intervention-like five days of consecutive TMS (e.g., Wang et al., 2014) are clustered with studies that delivered online rhythmic TMS, which tests target engagement (e.g., Hermiller et al., 2020). While the 'sessions' variable somewhat addresses the basic-science versus intervention-like approach, adding an explicit statement regarding this in the discussion might help the reader navigate the broad scope of approaches that are utilized in the meta-analysis.

    3. Reviewer #2 (Public review):

      Summary:

      In 2014, Wang et al. showed that noninvasive stimulation of a parietal site, connected functionally to the hippocampus, increased resting state connectivity throughout a canonical network associated with episodic memory. It also produced a memory boost, which correlated with the connectivity increase across subjects. Their discovery that an imaging biomarker could be used to target a network (rather than a single cortical site) in individual subjects and provide a scaling measure of target modulation should have revolutionized the noninvasive neuromodulation field. This meta-analysis by members of the same group covers memory effects from noninvasive stimulation of various nodes of the "hippocampal" network.

      Strengths:

      This is a very timely summary and meta-analysis of this very promising application of TMS. To the limited extent of my expertise in meta-analysis, the methodology seems rigorous, and the central finding, that high-frequency stimulation of nodes in the hippocampal network reproducibly improves event recall, is amply supported. This should provide impetus for larger clinical trials and further quantification of the optimal dose, duration of effect, etc.

      Weaknesses:

      My critical comments are mainly on the framing and argument:

      (1) While the introduction centers on the role of the hippocampus in episodic memory and posits hippocampal neuromodulation by TMS as causative, the true mechanism may be more complex. Clean hippocampal lesions in primates cause focal loss of spatial and place memory, and I am aware of no specific evidence that the hippocampus does more than this in humans. Moreover, there is evidence that lateral parietal TMS also reaches neighboring temporal lobe regions, which contribute to episodic memory. The hippocampus may, therefore, be a reliable deep seed for connectivity-based targeting of the episodic memory network, but might not be the true or only functional target.

      (2) The meta-analysis combines studies with confirmation of targeting and target-network engagement from fMRI and studies without independent evidence of having stimulated the putative target (e.g., Koch et al). That seems like a more important methodological distinction than merely the use of any individual targeting method. In my experience, atlas-based estimates are at least as accurate as eyeballing cortical areas in individuals. Hence, entering individual functional targeting as a factor might reveal an effect on efficacy.

      (3) The funnel plot and Egger's regression for episodic memory outcomes suggested possible bias, and the average sample size of 23 is small, contributing to the likelihood of false positive results. It would be informative, therefore, to know how many or which studies had formal power estimates and what the predicted effect sizes were.

      (4) In the Discussion, the authors might provide a comparison between the effect size for memory improvement found here with those reported for other brain-targeted interventions and behavioral strategies. It may also be worthwhile pointing out that HITS/memory is one of the very few, or perhaps the only, neuromodulatory effects on cognition that has been extensively reproduced and survived rigorous meta-analysis.

      (5) The section of the Discussion on specificity compares HITS to transcranial electrical stimulation without specifying an anatomical target or intended outcome. A better contrast might be the enormous variety of cognitive and emotional effects claimed for TMS of the dorsolateral prefrontal cortex.

      (6) With reference to why other nodes in the episodic memory network have not been tested, current flow modeling shows TMS of the medial prefrontal cortex is unlikely to be achievable without stronger stimulation of the convexity under the coil, in addition to being uncomfortable. The lateral temporal lobe has been stimulated without undue discomfort.

      (7) Finally, a critical question hanging over the clinical applicability of HITS and other neuromodulation techniques is how well they will work on a damaged substrate. Functional and/or anatomical imaging might answer this question and help screen for likely responders. The authors' opinion on this would be informative.

    4. Reviewer #3 (Public review):

      Summary:

      The manuscript by Goicoechea et al. assesses the influence of hippocampal-network targeted TMS to parietal cortex on episodic memory using a meta-analytic approach. This is an important contribution to the literature, as the number of studies using this approach to modulate memory/hippocampal function has clearly increased since the initial publication by Wang et al. 2014. This manuscript makes an important contribution to the literature. In general, the analysis is straightforward and the conclusions are well-supported by the results; I have mostly minor comments/concerns.

      Strengths:

      (1) A meta-analysis across published work is used to evaluate the influence of hippocampal-network-targeted TMS in parietal cortex on episodic memory. By pooling results across studies, the meta-analytic effects demonstrate an influence of TMS on memory across the diversity of many details in study design (specific tasks, stimuli, TMS protocols, study populations).

      (2) Selectivity with regard to episodic memory vs. non-episodic memory tasks is evaluated directly in the meta-analysis.

      (3) The investigation into supplemental factors as predictors of TMS's influence on memory was tested. This is helpful given the diversity of study designs in the literature. This analysis helps to shed light on which study designs, e.g., TMS protocols, etc., are most effective in memory modulation.

      Weaknesses:

      (1) My only significant concern is how studies are categorized in the 'Timing' factor (when stimulation is applied). Currently, protocols in which TMS is administered across days are categorized as 'pre-encoding' in the Timing factor. This has the potential to be misleading and may lead to inaccurate conclusions. When TMS is administered across multiple days, followed by memory encoding and retrieval (often on a subsequent day), it is not possible to attribute the influence of TMS to a specific memory phase (i.e., encoding or retrieval) per se. Thus, labeling multi-day TMS studies as 'pre-encoding' may be misleading to readers, as it may imply that the influence of TMS is due to modulation of encoding mechanisms per se, which cannot be concluded. For example, multi-day TMS protocols could be labeled as 'pre-retrieval' and be similarly accurate. This approach also pools results from TMS protocols with temporal specificity (i.e., those applied immediately during encoding and not on board during memory testing) and without temporal specificity (i.e., the case of multi-day TMS) regarding TMS timing. Given the variety of paradigms employed in the literature, and to maximize the utility/accuracy of this analysis, one suggestion is to modify the categories within the Timing factor, e.g., using labels like 'Temporally-Specific' and 'Temporally Non-specific'. The 'Temporally-Specific' category could be subdivided based on the specific memory process affected: 'encoding', 'retrieval', or 'consolidation' (if possible). I think this would improve the accuracy of the approach and help to reach more meaningful conclusions, given the variety of protocols employed in the literature.

      (2) As the scope of the meta-analysis is limited to TMS applied to parietal or superior occipital cortex, it is important to highlight this in the Introduction/Abstract. The 'HITS' terminology suggests a general approach that would not necessarily be restricted to parietal/nearby cortical sites.

      Minor:

      (1) To reduce the number of study factors tested, data reduction was performed via Lasso regression to remove factors that were not unique predictors of the influence of TMS on memory. This approach is reasonable; however, one limitation is that factors strongly correlated with others (and predict less unique variance) will be dropped. This may result in a misrepresentation, i.e., if readers interpret factors left out of this analysis as not being strongly related to the influence of TMS on memory. I do see and appreciate the paragraph in the Discussion which appropriately addresses this issue. However, it may be worth also considering an alternative analysis approach, if the authors have not already done so, which explicitly captures the correlation structure in the data (i.e., shown in Figure S2) using a tool like PCA or an appropriate factor analysis. Then, this shared covariance amongst factors can be tested as predictors of the influence of TMS - e.g., by testing whether component scores for dominant PCs are indeed predictive of the influence of TMS. This complementary approach would capture rather than obfuscate the extent to which different factors are correlated and assess their joint (rather than independent) influence on memory, potentially resulting in more descriptive conclusions. For example, TMS intensity and protocol may jointly influence memory.

      (2) Given the specific focus on TMS applied to parietal cortex to modulate hippocampal and related network function, it would be fruitful if the authors could consider adding discussion/speculation regarding whether this approach may be effectively broadened using other stimulation methods (e.g., tACS, tDCS), how it may compare to other non-invasive brain stimulation methods with depth penetration to target hippocampal function directly (transcranial temporal interference, or transcranial focused ultrasound), and/or how or whether other stimulation sites may or may not be effective.

      (3) Studies were only included in the meta-analysis if they contained objective episodic memory tests. How were studies handled that included both objective and subjective memory, or other non-episodic memory measures? For example, Yazar et al. 2014 showed no influence of TMS on objective recall, but an impairment in subjective confidence. I assume confidence was not included in the meta-analysis. Similarly, Webler et al. 2024 report results from both the mnemonic similarity task (presumably included) and a fear conditioning paradigm (presumably excluded). Please clarify in the methods how these distinctions were handled.

      (4) The analysis comparing memory to non-memory measures is important, showing the specificity of stimulation. Did the authors consider further categorizing the non-memory tasks into distinct domains (i.e., language, working memory, etc.)? If possible, this could provide a finer detail regarding the selectivity of influences on memory vs. other aspects of cognition. It is likely that other aspects of cognition dependent on hippocampal function may be modulated as well, i.e., tasks with high relational/associative processing demands.

      (5) In the analysis of the Intensity factor, how were studies using Active (rather than resting) MT categorized? Only resting MT is mentioned in Table S1. This is important as the original theta-burst TMS protocol from Huang et al. 2005 determines intensity based on Active Motor Threshold.

      (6) Is there a reason why the study by Koen et al. 2018 (Cognitive Neuroscience) was not included? TMS was performed during encoding to the left AG, and objective memory was assessed, so it would seemingly meet the inclusion criterion.

      (7) It would be helpful to briefly differentiate the current meta-analysis from that performed by Yeh & Rose (How can transcranial magnetic stimulation be used to modulate episodic memory?: A systematic review and meta-analysis, 2019, Frontiers in Psychology) (other than being more current).

      (8) For transparency and to facilitate further understanding of the literature and potential data re-use, it would be great if the authors consider sharing a supplementary table or file that describes how individual studies/memory measures were categorized under the factors listed in Table S1.

    1. eLife Assessment

      This useful study provides a well-constructed computational investigation of how intermittent theta-burst stimulation (iTBS) influences synaptic plasticity within the corticothalamic circuit, improving our mechanistic understanding of how stimulation parameters interact with intrinsic brain oscillations. The authors build a corticothalamic population model that generates individual alpha rhythms with a calcium-dependent metaplasticity rule, and provide solid evidence that aligning stimulation frequencies to brain-intrinsic oscillatory subharmonics enhances plasticity effects. This insight could open a route toward personalized, more effective stimulation protocols.

    2. Reviewer #1 (Public review):

      Summary:

      The authors show that the lower frequency (~5Hz) stimulation of the intermittent theta-burst stimulation (iTBS) via repetitive transcranial magnetic stimulation (rTMS) serves as a more effective stimulation paradigm than the high-frequency protocols (HF-rTMS, ~10Hz) with enhancing plasticity effects via long-term potentiation (LTP) and depression (LTD) mechanisms. They show that the 5 Hz patterned pulse structure of the iTBS is an exact subharmonic of the 10 Hz high-frequency rTMS, creating a connection between the two paradigms and acting upon the same underlying synchrony mechanism of the dominant alpha-rhythm of the corticothalamic circuit.

      First, the authors create a corticothalamic neural population model consisting of 4 populations: cortical excitatory pyramidal and inhibitory interneuron, and thalamic excitatory relay and inhibitory reticular populations. Second, the authors include a calcium-dependent plasticity model, in which calcium-related NMDAR-dependent synaptic changes are implemented using a BCM metaplasticity rule. The rTMS-induced fluctuations in intracellular calcium concentrations determine the synaptic plasticity effects.

      Strengths:

      The model (corticothalamic neural population with calcium-dependent plasticity, with TBS input for rTMS) is thoroughly built and analyzed.

      The conclusions seem sound and justified. The authors justifiably link stimulation parameters (especially the alpha subharmonics iTBS frequency) with fluctuations in calcium concentration and their effects on LTP and LTD in relevant parts of the corticothalamic circuit populations leading to a dampening of corticothalamic loop gains and enhancement of intrathalamic gains with an overall circuit-wide feedforward inhibition (= inhibitory activity is enhanced via excitatory inputs onto inhibitory neurons) and a resulting suppression of the activity power. In other words: alpha-resonant iTBS protocols achieve broadband power suppression via selective modulation of corticothalamic FFI.

      (1) The model is well-described, with the model equations in the main text and the parameters in well-formatted tables.

      (2) The relationship between iTBS timing and the phase of rhythms is well explained conceptually.

      (3) Metaplasticity and feedforward inhibition regulation as a driver for the efficacy of iTBS are well explored in the paper.

      (4) Efficacy of TBS, being based on mimicry of endogenous theta patterns, seems well supported by this simulation.

      (5) Recovery between periods of calcium influx as an explanation for why intermittency produces LTP effects where continuous stimulation fails is a good justification for calcium-based metaplasticity, as well as for the role of specific pulse rate.

      (6) Circuit resonance conclusion is interesting as a modulating factor; the paper supports this hypothesis well.

      (7) The analysis of corticothalamic dampening and intrathalamic enhancement in the 3D XYZ loop gain space is a strong aspect of the paper.

      Weaknesses:

      (1) Overall, the paper is difficult to follow narratively - the motivation (formulated as a specific research question) for each section can be a bit unclear. The paper could benefit from a minor rewrite at the start of each section to justify each section's reasoning. The Discussion is too long and should be shortened and limited to the main points.

      (2) While the paper refers to modelling and data in discussion, there is no direct comparison of the simulations in the figures to data or other models, so it's difficult to evaluate directly how well the modelling fits either the existing model space or data from this region. Where exactly the model/plasticity parameters from Table 5 and the NFTsim library come from is not easy to find. The authors should make the link from those parameters to experimental data clearer. For example, which clinical or experimental data are their simulations of the resting-state broadband power suppression based on?

      (3) The figures should be modified to make them more understandable and readable.

      (4) The claim in the abstract that the paper introduces "a novel paradigm for individualizing iTBS treatments" is too strong and sounds like overselling. The paper is not the first computational modelling of TBS - as acknowledged also by the authors when citing previous mean-field plasiticity modelling articles. Btw. the authors could briefly mention and include also references also to biophysically more detailed multi-scale approaches such as https://doi.org/10.1016/j.brs.2021.09.004 and https://doi.org/10.1101/2024.07.03.601851 and https://doi.org/10.1016/j.brs.2018.03.010

      (5) The modelling assumes the same CaDP model/mechanism for all excitatory synapses/afferents. How well is this supported by experimental evidence? Have all excitatory synaptic connections in the cortico-thalamic circuit been shown to express CaDP and metaplasticity? If not, these limitations (or predictions of the model) should be mentioned. Why were LTP calcium volumes never induced within thalamic relay-afferent connections se and sr? What about inhibitory synapses in the circuit model? Were they plastic or fixed?

      (6) Minor point: Metaplasticity is modelled as an activity-dependent shift in NMDAR conductance, which is supported by some evidence, but there are other metaplasticity mechanisms. Altering NMDA-synapse affects also directly synaptic AMPA/NMDA weight and ratio (which has not been modelled in the paper). Would the model still work using other - more phenomenological implementation of the sliding threshold - e.g. based on shifting calcium-dependent LTP/LTD windows or thresholds (for a phenomenological model of spike/voltage-based STDP-BCM rules, see https://doi.org/10.1007/s10827-006-0002-x and https://doi.org/10.1371/journal.pcbi.1004588) - maybe using a metaplasticity extension of Graupner and Brunel CaDP model. A brief discussion of these issues might be added to the manuscript - but this is just a suggestion.

      (7) Short-term plasticity (depression/facilitation) of synapses is neglected in the model. This limitation should be mentioned because adding short-term synaptic dynamics might affect strongly circuite model dynamics.

    3. Reviewer #2 (Public review):

      Transcranial magnetic stimulation is used in several medical conditions to alter brain activity, probably by induction of synaptic plasticity. The authors pursue the idea to personalise parameters of the stimulation protocol by adapting the stimulation frequency to an individual's brain rhythm. The authors test this approach in a population model connecting the cortex with deeper brain areas, the thalamocortical loop, which includes calcium-dependent plasticity for the connections within and between brain regions. While the authors relate literature-based experimental findings with their results, their results are so far not supported by experimental work.

      The authors successfully highlight in their model that personalization of rTMS stimulation frequency to the brain intrinsic frequency has the potential to improve stimulation impact, and they relate this to specific changes in the network. Their arguments that this resonance improves efficacy are intuitive, and their finding that inhibition and excitation are selectively modulated is a good starting point for analysing the underlying mechanism.

      As rTMS is used in clinical contexts, and the idea of aligning intrinsic and stimulation frequency is relatively easy to implement, the paper is conceptually of interest for the rTMS community, despite its weak points on the mechanistic explanation. The authors made the simulation code publicly available, which is a useful resource for further studies on the effects of metaplasticity. The same stimulation parameters have been tested in experiments, and a reanalysis of the experimental results following the idea of this paper could be influential for clinical optimisation of stimulation protocols.

      A strength of the paper is that it takes into account also deeper brain areas, and their interaction with the cortex. The paper carefully measures system changes in response to different frequency differences between thalamocortical loop and stimulation. By explicitly modelling changes to connections, the authors do start dissect the mechanism underlying the observed effect. Unfortunately, the dissection of the mechanistic underpinning in the current version of the manuscript does not yet fully exploits the possibility of a computational model. Here are a couple of points related to this critique:

      (1) The study reports that connections between thalamus and cortex as well as within the thalamus change, but the model is not used to separate the influence of both.

      (2) The paper reports that a resonance between stimulation and brain increases stimulation effectiveness. This conclusion is solely based on the observation of strong reactions in the network to subharmonics of the brain's frequency, and lacks further support such as alternative measures of resonance, or an analysis of the role of the phase difference between stimulation and brain oscillation, which is likely changed by the stimulation. For example, for harmonic oscillators, resonance leads to a 90 degree phase difference between driving force and system response, and for rTMS, phase locking has been shown to be relevant.

      (3) The authors claim that over-engagement of plasticity for HF-rTMS makes their intermittent protocol more effective. Yet, the study lacks a direct comparison between stimulation protocols that shows over-engagement of plasticity for the HF-protocol. The study also does not explore which time-scale of the plasticity mechanism rules the optimal stimulation protocol. Moreover, the study reports that only few number of pulses per burst show a good effect. This should depend on how strongly a single pulse changes the calcium volume, but this relation was not explored in the model.

      (4) The authors report on the frequency spectrum of the cortical excitatory population, with the argument that the power of this population is most closely related to EEG measurements. A report of the other neuronal populations is missing, which might be informative on what is going on in the network.

      Statistics:

      (1) The authors do not state whether they test for assumptions of the multiple regression analysis, such as whether errors have equal variance or that residuals are normally distributed.

      (2) For the statistical analysis, the authors ignore about half of their model simulations for which the change in the power was negligible. It is not clear to me which statistical analysis is meant; whether the figures show all model simulations, whether regression lines where evaluated ignoring them, and whether the multiple regression analysis used only half of the data points.

    4. Reviewer #3 (Public review):

      Summary:

      This article presented a novel computer model to address an important question in the field of brain stimulation, using the magnetic stimulation iTBS protocol as an example, how stimulation parameters, frequency in particular, interfere with the intrinsic brain oscillations via plastic mechanisms. Brain oscillation is a critical feature of functional brains and its alteration signals the onset of many neuropsychiatric diseases or certain brain states. The authors suggested with their model that harmonic and subharmonic stimulations close to the individual alpha frequency achieved strong broadband power suppression.

      Strengths:

      The authors focused on the cortico-thalamic circuitry and managed to generate alpha oscillations in their four-population model. By adding the non-monotonic calcium-based BCM rule, they have also achieved both homeostasis and plasticity in response to magnetic stimulation. This work combined computer simulations and statistical analysis to demonstrate the changes in network architecture and network dynamics triggered by varied magnetic stimulation parameters. By delivering the iTBS protocol to the cortical excitatory population, the key findings are that harmonic and subharmonic stimulations close to the individual alpha frequency (IAF) achieved strong broadband power suppression. This resulted from increased synaptic weights of the corticothalamic feed-forward inhibitory projections, which were mediated by the calcium dynamics perturbed by iTBS magnetic stimulation. This finding endorsed the importance of applying customized stimulation to patients based on their IAFs and suggested the underlying mechanism at the circuitry level.

      Weaknesses:

      The drawbacks of this work are also obvious. Model validation and biological feasibility justification should be better addressed. The primary outcome of their model is the broadband power suppression and the optimal effects of (sub)harmonic stimulation frequency, but it lacks immediate empirical support in the literature. To the best of my knowledge, many alpha frequency tACS studies reported to increase but not suppress the power of certain brain oscillations. A review by Wang et al., 2024 (Frontiers in System Neuroscience) suggested hybrid changes to different brain oscillations by magnetic stimulation. Developing a model to fully capture such changes might be out of the scope of the present study and challenging in the entire field, but it undermines the quality of the present work if not extensively discussed and justified. Clarity and reproducibility of the work can be improved. Although it is intriguing to see how the calcium-dependent BCM plasticity mediates such changes, the writing of the methods part is not hard to follow. It was also not clear why only two populations were considered in the thalamus, how the entire network was connected, or how the LTP/LTD threshold alters with calcium dynamics. The figures were unfortunately prepared in a nested manner. The crowded layout and the tiny font sizes reduce the clarity. The third point comes to contextualization and comparison to existing models. It will strengthen the work if the authors could have compared their work to other TMS modeling work with plasticity rules, e.g, Anil et al., 2024. Besides, magnetic stimulation is unique in being supra-threshold and having focality compared to other brain stimulation modalities, e.g., tDCS and tACS, but they may share certain basic neural mechanisms if accounting for certain parameters, e.g., frequency. A solid literature review and discussion on this part may help the field better perceive the value and potential limitations of this work.

    1. eLife Assessment

      This study is an important contribution to the field of viral sequencing, providing methods for more accurate characterization of viral genetic diversity using long-read sequencing and unique molecular identifiers (UMIs). Although it is a small pilot study, it shows promise as a convincing, validated methodology with broad applicability.

    2. Reviewer #1 (Public review):

      Tamao et al. aimed to quantify the diversity and mutation rate of the influenza (PR8 strain) in order to establish a high-resolution method for studying intra-host viral evolution . To achieve this, the authors combined RNA sequencing with single-molecule unique molecular identifiers (UMIs) to minimize errors introduced during technical processing. They proposed an in vitro infection model with a single viral particle to represent biological genetic diversity, alongside a control model using in vitro transcribed RNA for two viral genes, PB2 and HA.

      Through this approach, the authors demonstrated that UMIs reduced technical errors by approximately tenfold. By analyzing four viral populations and comparing them to in vitro transcribed RNA controls, they estimated that ~98.1% of observed mutations originated from viral replication rather than technical artifacts. Their results further showed that most mutations were synonymous and introduced randomly. However, the distribution of mutations suggested selective pressures that favored certain variants. Additionally, comparison with closely related influenza strain (A/Alaska/1935) revealed two positively selected mutations, though these were absent in the strain responsible for the most recent pandemic (CA01).

      Overall, the study is well-designed, and the interpretations are strongly supported by the data.

      The authors have addressed all the comments from the previous round of reviews. No further concerns.

    3. Reviewer #2 (Public review):

      Summary:

      This manuscript presents a technically oriented application of UMI-based long-read sequencing to study intra-host diversity in influenza virus populations. The authors aim to minimize sequencing artifacts and improve the detection of rare variants, proposing that this approach may inform predictive models of viral evolution. While the methodology appears robust and successfully reduces sequencing error rates, key experimental and analytical details are missing, and the biological insight is modest. The study includes only four samples, with no independent biological replicates or controls, which limits the generalizability of the findings. Claims related to rare variant detection and evolutionary selection are not fully supported by the data presented.

      Strengths:

      The study addresses an important technical challenge in viral genomics by implementing a UMI-based long-read sequencing approach to reduce amplification and sequencing errors. The methodological focus is well presented, and the work contributes to improving the resolution of low-frequency variant detection in complex viral populations.

      Weaknesses:

      The application of UMI-based error correction to viral population sequencing has been established in previous studies (e.g., in HIV), and this manuscript does not introduce a substantial methodological or conceptual advance beyond its use in the context of influenza.

      The study lacks independent biological replicates or additional viral systems that would strengthen the generalizability of the conclusions. Potential sources of technical error are not explored or explicitly controlled. Key methodological details are missing, including the number of PCR cycles, the input number of molecules, and UMI family size distributions. These are essential to support the claimed sensitivity of the method.

      The assertion that variants at {greater than or equal to}0.1% frequency can be reliably detected is based on total read count rather than the number of unique input molecules. Without information on UMI diversity and family sizes, the detection limit cannot be reliably assessed.

      Although genetic variation is described, the functional relevance of observed mutations in HA and NA is not addressed or discussed in the context of known antigenic or evolutionary features of influenza. The manuscript is largely focused on technical performance, with limited exploration of the biological implications or mechanistic insights into influenza virus evolution.

      The experimental scale is small, with only four viral populations derived from single particles analyzed. This limited sample size restricts the ability to draw broader conclusions about quasispecies dynamics or evolutionary pressures.

      Comments on revisions:

      The revised manuscript provides additional methodological detail and clearer presentation, which improves transparency. However, the main limitations persist: the study remains small in scale, lacks independent validation, and relies on theoretical rather than empirical support for its claimed detection sensitivity. As a result, the work represents a modest technical advance rather than a substantive contribution to understanding influenza virus evolution.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The methods section is overly brief. Even if techniques are cited, more experimental details should be included. For example, since the study focuses heavily on methodology, details such as the number of PCR cycles in RT-PCR or the rationale for choosing HA and PB2 as representative in vitro transcripts should be provided.

      We thank the reviewer for this important suggestion. We have now expanded the Methods section to include the number of PCR cycles used in RT-PCR (line 407) and have explained the rationale for choosing HA and PB2 as representative transcripts (line 388).

      (2) Information on library preparation and sequencing metrics should be included. For example, the total number of reads, any filtering steps, and quality score distributions/cutoff for the analyzed reads.

      We agree and have added detailed information on library preparation, filtering criteria, quality score thresholds, and sequencing statistics for each sample (line 422, Figure S2).

      (3) In the Results section (line 115, "Quantification of error rate caused by RT"), the mutation rate attributed to viral replication is calculated. However, in line 138, it is unclear whether the reported value reflects PB2, HA, or both, and whether the comparison is based on the error rate of the same viral RNA or the mean of multiple values (as shown in Figure 3A). Please clarify whether this number applies universally to all influenza RNAs or provide the observed range.

      We appreciate this point. We have clarified in the Results (line 140) that the reported value corresponds to PB2.

      (4) Since the T7 polymerase introduced errors are only applied to the in vitro transcription control, how were these accounted for when comparing mutation rates between transcribed RNA and cell-culture-derived virus?

      We agree that errors introduced by T7 RNA polymerase are present only in the in vitro–transcribed RNA control. However, even when taking this into account, the error rate detected in the in vitro transcripts remained substantially lower than that observed in the viral RNA extracted from replicated virus (line 140, Fig.3a). Thus, the difference cannot be explained by T7-derived errors, and our conclusion regarding the elevated mutation rate in cell-culture–derived viral populations remains valid.

      (5) Figure 2 shows that a UMI group size of 4 has an error rate of zero, but this group size is not mentioned in the text. Please clarify.

      We have revised the Results (line 98) to describe the UMI group size of 4.

      Reviewer #2 (Public review):

      (1) The application of UMI-based error correction to viral population sequencing has been established in previous studies (e.g., HIV), and this manuscript does not introduce a substantial methodological or conceptual advance beyond its use in the context of influenza.

      We appreciate the reviewer’s comment and agree that UMI-based error correction has been applied previously to viral population sequencing, including HIV. However, to our knowledge, relatively few studies have quantitatively evaluated both the performance of this method and the resulting within-quasi-species mutation distributions in detail. In our manuscript, we not only validate the accuracy of UMIbased error correction in the context of influenza virus sequencing, but also quantitatively characterize the features of intra-quasi-species distributions, which provides new insights into the mutational landscape and evolutionary dynamics specific to influenza. We therefore believe that our work goes beyond a simple application of an established method.

      (2) The study lacks independent biological replicates or additional viral systems that would strengthen the generalizability of the conclusions.

      We agree with the reviewer that the lack of independent biological replicates and additional viral systems limits the generalizability of our findings. In this study, we intentionally focused on single-particle–derived populations of influenza virus to establish a proof-of-principle for our sequencing and analytical framework. While this design provided a clear demonstration of the method’s ability to capture mutation distributions at the single-particle level, we acknowledge that additional biological replicates and testing across diverse viral systems would be necessary to confirm the broader applicability of our observations. Importantly, even within this limited framework, our analysis enabled us to draw conclusions at the level of individual viral populations and to suggest the possibility of comparing their mutation distributions with known evolvability. This highlights the potential of our approach to bridge observations from single particles with broader patterns of viral evolution. In future work, we plan to expand the number of populations analyzed and include additional viral systems, which will allow us to more rigorously assess reproducibility and to establish systematic links between mutation accumulation at the single-particle level and evolutionary dynamics across viruses.

      (3) Potential sources of technical error are not explored or explicitly controlled. Key methodological details are missing, including the number of PCR cycles, the input number of molecules, and UMI family size distributions.

      We thank the reviewer for this important suggestion. We have now expanded the Methods section to include the number of PCR cycles used in RT-PCR (line 407). In addition, we have added information on the estimated number of input molecules. Regarding the UMI family size distributions, we have added the data as Figure S2 and referred to it in the revised manuscript.

      Finally, with respect to potential sources of technical error, we note that this point is already addressed in the manuscript by direct comparison with in vitro transcribed RNA controls, which encompass errors introduced throughout the entire experimental process. This comparison demonstrates that the error-correction strategy employed here effectively reduces the impact of PCR or sequencing artifacts.

      (4) The assertion that variants at ≥0.1% frequency can be reliably detected is based on total read count rather than the number of unique input molecules. Without information on UMI diversity and family sizes, the detection limit cannot be reliably assessed.

      We thank the reviewer for raising this important issue. We agree that our original description was misleading, as the reliable detection limit should not be defined solely by total read count. In the revised version, we have added information on UMI distribution and family sizes (Figure S2), and we now state the detection limit in terms of consensus reads. Specifically, we define that variants can be reliably detected when ≥10,000 consensus reads are obtained with a group size of ≥3 (line 173). 

      (5)  Although genetic variation is described, the functional relevance of observed mutations in HA and NA is not addressed or discussed.

      We appreciate the reviewer’s suggestion. In our study, we did not apply drug or immune selection pressure; therefore, we did not expect to detect mutations that are already known to cause major antigenic changes in HA or NA, and we think it is difficult to discuss such functional implications in this context. However, as noted in discussion, we did identify drug resistance–associated mutations. This observation suggests that the quasi-species pool may provide functional variation, including resistance, even in the absence of explicit selective pressure. We have clarified this point in the text to better address the reviewer’s concern (line 330).

      (6) The experimental scale is small, with only four viral populations derived from single particles analyzed. This limited sample size restricts the ability to draw broader conclusions.

      We thank the reviewer for pointing out the limitation of analyzing only four viral populations derived from single particles. We fully acknowledge that the small sample size restricts the generalizability of our conclusions. Nevertheless, we would like to emphasize that even within this limited dataset, our results consistently revealed a slight but reproducible deviation of the mutation distribution from the Poisson expectation, as well as a weak correlation with inter-strain conservation. These recurring patterns highlight the robustness of our observations despite the sample size.

      In future work, we plan to expand the number of viral populations analyzed and to monitor mutation distributions during serial passage under defined selective pressures. We believe that such expanded analyses will enable us to more reliably assess how mutations accumulate and to develop predictive frameworks for viral evolution.

      Reviewer #1 (Recommendations for the authors):

      (1)  Please mention Figure 1 and S2 in the text.

      Done. We now explicitly reference Figures 1 and S2 (renamed to S1 according to appearance order) in the appropriate sections (lines 74, 124).

      (2)  In Figure 4A, please specify which graph corresponds to PB2 and which to PB2-like sequences.

      Corrected. Figure 4A legend now specify PB2 vs. PB2-like sequences.

      (3)  Consider reducing redundancy in lines 74, 149, 170, 214, and 215.

      We thank the reviewer for this stylistic suggestion. We have revised the text to reduce redundancy in these lines.

      Reviewer #2 (Recommendations for the authors):

      (1)  The manuscript states that "with 10,000 sequencing reads per gene ...variants at ≥0.1% frequency can be reliably detected." However, this interpretation conflates raw read counts with independent input molecules.

      We have revised this statement throughout the text to clarify that sensitivity depends on the number of unique UMIs rather than raw read counts (line 173). To support this, we calculated the probability of detecting a true variant present at a frequency of 0.1% within a population. When sequencing ≥10,000 unique molecules, such a variant would be observed at least twice with a probability of approximately 99.95%. In contrast, the error rate of in vitro–transcribed RNA, reflecting errors introduced during the experimental process, was estimated to be on the order of 10⁻⁶ (line 140, Fig. 3a). Under this condition, the probability that the same artificial error would arise independently at the same position in two out of 10,000 molecules is <0.5%. Therefore, variants present at ≥0.1% can be reliably distinguished from technical artifacts and are confidently detected under our sequencing conditions.

      (2) To support the claimed sensitivity, please provide for each gene and population: (a) UMI family size distributions, (b) number of PCR cycles and input molecule counts, and (c) recalculation of the detection limit based on unique molecules.

      If possible, I encourage experimental validation of sensitivity claims, such as spike-in controls at known variant frequencies, dilution series, or technical replicates to demonstrate reproducibility at the 0.1% detection level.

      We have added (a) histograms of UMI family size distributions for each gene and population (Figure S2), (b) detailed method RT-PCR protocol and estimated input counts (line 407), and (c) recalculated detection limits (line 173).

      We appreciate the reviewer’s suggestion and fully recognize the value of spike-in experiments. However, given the observed mutation rate of T7-derived RNA and the sufficient sequencing depth in our dataset, it is evident that variants above the 0.1% threshold can be robustly detected without additional spike-in controls.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Summary:

      The aim of this paper is to develop a simple method to quantify fluctuations in the partitioning of cellular elements. In particular, they propose a flow-cytometry based method coupled with a simple mathematical theory as an alternative to conventional imaging-based approaches.

      Strengths:

      The approach they develop is simple to understand and its use with flow-cytometry measurements is clearly explained. Understanding how the fluctuations in the cytoplasm partition varies for different kinds of cells is particularly interesting.

      Weaknesses:

      The theory only considers fluctuations due to cellular division events. Fluctuations in cellular components are largely affected by various intrinsic and extrinsic sources of noise and only under particular conditions does partitioning noise become the dominant source of noise. In the revised version of the manuscript, they argue that in their setup, noise due to production and degradation processes are negligible but noise due to extrinsic sources such as those stemming from cell-cycle length variability may still be important. To investigate the robustness of their modelling approach to such noise, they simulated cells following a sizer-like division strategy, a scenario that maximizes the coupling between fluctuations in cell-division time and partitioning noise. They find that estimates remain within the pre-established experimental error margin.

      We thank the Reviewer for her/his work in revising our manuscript.

      Reviewer #2 (Public review):

      Summary:

      The authors present a combined experimental and theoretical workflow to study partitioning noise arising during cell division. Such quantifications usually require time-lapse experiments, which are limited in throughput. To bypass these limitations, the authors propose to use flow-cytometry measurements instead and analyse them using a theoretical model of partitioning noise. The problem considered by the authors is relevant and the idea to use statistical models in combination with flow cytometry to boost statistical power is elegant. The authors demonstrate their approach using experimental flow cytometry measurements and validate their results using time-lapse microscopy. The approach focuses on a particular case, where the dynamics of the labelled component depends predominantly on partitioning, while turnover of components is not taken into account. The description of the methods is significantly clearer than in the previous version of the manuscript.

      We thank the Reviewer for her/his work in revising our manuscript. In the following, we address the remaining raised points.

      I have only two comments left:

      • In eq. (1) the notation has been changed/corrected, but the text immediately after it still refers to the old notation.

      We have fixed the notation.

      • Maybe I don't fully understand the reasoning provided by the authors, but it is still not entirely clear to me why microscopy-based estimates are expected to be larger. Fewer samples will increase the estimation uncertainty, but this can go either way in terms of the inferred variability.

      We thank the Reviewer for giving us the opportunity to clarify this point. In the previous answer, we focused on the role of the gating strategy, highlighting how the limited statistics available with microscopy reduce the chances of a stronger selection of the events. The explanation for why the noise is biased toward increasing the estimation of division asymmetry relies on multiple aspects: First, due to the multiple sources of noise affecting fluorescence intensity, the experimental procedure, and the segmentation protocol, the measurements of the fluorescence intensity of single cells fluctuate. This variability adds to the inherent stochasticity of the partitioning process, thereby increasing the overall variance of the distribution.

      To illustrate this effect, we simulated the microscopy data. We extracted a fraction f from a Gaussian distribution with mean µ = 𝑝 and standard deviation σ = σ<sub>𝑡𝑟𝑢𝑒</sub> , i.e. 𝑁(𝑝, σ<sub>𝑡𝑟𝑢𝑒</sub>). We then simulated different time frames by adding noise drawn from a Gaussian distribution with mean µ = 0 and standard deviation σ = σ<sub>𝑛𝑜𝑖𝑠𝑒</sub> , i.e., 𝑁(0, σ<sub>𝑛𝑜𝑖𝑠𝑒</sub>), to f. An equal process was applied to 1 − f. The added noise was resampled so that the two measurements remained independent. Figure 6 shows a sample dynamic where the empty gray circles represent the true fractions. We then fitted the two dynamics to a linear equation with a common slope and obtained an estimate of the partitioning noise.

      By repeating this process a number of times consistent with the experiment, we measured the resulting standard deviation of the new partitioning distribution. Figure 7 shows the distribution of the measured standard deviation over multiple repetitions of the simulations. Each histogram is the variance of the partitioning distribution obtained from 100 simulations of the noisy (and non noisy) fluorescence dynamic. By comparing this with the distribution of the standard deviation of the non-noisy dynamics, it is possible to observe that, on average, the added noise leads to a greater estimated variance. The magnitude of this increase depends on the variance of the added noise, but it is always biased toward larger values.

      This represents only one component of the effect. The shown distributions and simulations are intended solely to demonstrate the direction of the bias, and not to account for the exact difference between the flow cytometry and microscopy estimates. In the proposed case, where noise and true variance are equal, the resulting difference in division asymmetry is 1.3.

      A second contribution arises from the segmentation protocol. As we stated, a major limitation of the microscopy-based approach is the need for manual image segmentation. This reduces the amount of available data and introduces potential errors. Even though different checks were applied, some situations are difficult to avoid. For example, when daughter cells are very close to each other, the borders may not be precisely recognized; cells may overlap; or speckles may remain undetected. In all these cases, it is easier to overestimate the fluorescence than to underestimate it, thereby increasing the chance of an extremal event.

      Indeed, segmentation relies on both brightfield and fluorescence images. Errors in defining the cell outline are more likely when fluorescence is low, since borders, overlaps, and speckles are more evident against a darker background. This introduces an additional bias toward higher asymmetry, increasing the number of events in the tail of the partitioning distribution.

      Both aspects described above could be mitigated by increasing the available statistics. In particular, by applying stricter selection criteria, such as imposing limits on fluorescence intensity fluctuations, the distribution should approach the expected one.

      A similar issue does not arise in flow cytometry experiments. From the initial sorting procedure, which ensures a cleaner separation of peaks, to the morphological checks performed at each acquisition point, the availability of a large number of measured events reduces both measurement noise and segmentation errors.

      A discussion on these aspects has been added in the revised version of the Supplementary Materials and in the Main Text.

    2. Reviewer #2 (Public review):

      The authors present a combined experimental and theoretical workflow to study partitioning noise arising during cell division. Such quantifications usually require time-lapse experiments, which are limited in throughput. To bypass these limitations, the authors propose to use flow-cytometry measurements instead and analyse them using a theoretical model of partitioning noise. The problem considered by the authors is relevant and the idea to use statistical models in combination with flow cytometry to boost statistical power is elegant. The authors demonstrate their approach using experimental flow cytometry measurements and validate their results using time-lapse microscopy. The approach focuses on a particular case, where the dynamics of the labelled component depends predominantly on partitioning, while turnover of components is not taken into account. The description of the methods is significantly clearer than in the previous version of the manuscript.

    3. Reviewer #1 (Public review):

      Summary:

      The aim of this paper is to develop a simple method to quantify fluctuations in the partitioning of cellular elements. In particular, they propose a flow-cytometry based method coupled with a simple mathematical theory as an alternative to conventional imaging-based approaches.

      Strengths:

      The approach they develop is simple to understand, and its use with flow-cytometry measurements is clearly explained. Understanding how the fluctuations in the cytoplasm partition varies for different kinds of cells is particularly interesting.

      Weaknesses:

      The theory only considers fluctuations due to cellular division events. Fluctuations in cellular components are largely affected by various intrinsic and extrinsic sources of noise and only under particular conditions does partitioning noise become the dominant source of noise. In the revised version of the manuscript, they argue that in their setup, noise due to production and degradation processes are negligible but noise due to extrinsic sources such as those stemming from cell-cycle length variability may still be important. To investigate the robustness of their modelling approach to such noise, they simulated cells following a sizer-like division strategy, a scenario that maximizes the coupling between fluctuations in cell-division time and partitioning noise. They find that estimates remain within the pre-established experimental error margin.

      Comments on previous version:

      The authors have addressed all of my comments.

    4. eLife Assessment

      This study presents a useful method based on flow cytometry to study partitioning noise during cell division. The methods, data and analysis support the claims of the authors is convincing. This work will be of interest to cell biologists and biophysicists working on asymmetric partitioning during cell division.

    1. eLife Assessment

      This important study combines brain stimulation with fMRI and behavioural modelling to probe the role of the left superior frontal sulcus in perceptual and value-based decision making. The evidence that the left SFS plays a key role in perceptual decision making is convincing; the results also suggest that the value-based decision process was largely unaffected by the stimulation, despite a change in response times.

    2. Reviewer #1 (Public review):

      Summary:

      In this study, participants completed two different tasks. A perceptual choice task in which they compared the sizes of pairs of items and a value-different task in which they identified the higher value option among pairs of items with the two tasks involving the same stimuli. Based on previous fMRI research, the authors sought to determine whether the superior frontal sulcus (SFS) is involved in both perceptual and value-based decisions or just one or the other. Initial fMRI analyses were devised to isolate brain regions that were activated for both types of choices and also regions that were unique to each. Transcranial magnetic stimulation was applied to the SFS in between fMRI sessions and it was found to lead to a significant decrease in accuracy and RT on the perceptual choice task but only a decrease in RT on the value-different task. Hierarchical drift diffusion modelling of the data indicated that the TMS had led to a lowering of decision boundaries in the perceptual task and a lower of non-decision times on the value-based task. Additional analyses show that SFS covaries with model derived estimates of cumulative evidence, that this relationship is weakened by TMS.

      Strengths:

      The paper has many strengths, including the rigorous multi-pronged approach of causal manipulation, fMRI and computational modelling, which offers a fresh perspective on the neural drivers of decision making. Some additional strengths include the careful paradigm design, which ensured that the two types of tasks were matched for their perceptual content while orthogonalizing trial-to-trial variations in choice difficulty. The paper also lays out a number of specific hypotheses at the outset regarding the behavioural outcomes that are tied to decision model parameters and well justified.

      Weaknesses:

      In my previous comments (1.3.1 and 1.3.2) I noted that key results could be potentially explained by cTBS leading to faster perceptual decision making in both the perceptual and value-based tasks. The authors responded that if this were the case then we would expect either a reduction in NDT in both tasks or a reduction in decision boundaries in both tasks (whereas they observed a lowering of boundaries in the perceptual task and a shortening of NDT in the value task). I disagree with this statement. First, it is important to note that the perceptual decision that must be completed before the value-based choice process can even be initiated (i.e. the identification of the two stimuli) is no less trivial than that involved in the perceptual choice task (comparison of stimulus size). Given that the perceptual choice must be completed before the value comparison can begin, it would be expected that the model would capture any variations in RT due to the perceptual choice in the NDT parameter and not as the authors suggest in the bound or drift rate parameters since they are designed to account for the strength and final quantity of value evidence specifically. If, in fact, cTBS causes a general lowering of decision boundaries for perceptual decisions (and hence speeding of RTs) then it would be predicted that this would manifest as a short NDT in the value task model, which is what the authors see.

    3. Reviewer #2 (Public review):

      Summary:

      The authors set out to test whether a TMS-induced reduction in excitability of the left Superior Frontal Sulcus influenced evidence integration in perceptual and value-based decisions. They directly compared behaviour-including fits to a computational decision process model---and fMRI pre and post TMS in one of each type of decision-making task. Their goal was to test domain-specific theories of the prefrontal cortex by examining whether the proposed role of the SFS in evidence integration was selective for perceptual but not value-based evidence.

      Strengths:

      The paper presents multiple credible sources of evidence for the role of the left SFS in perceptual decision making, finding similar mechanisms to prior literature and a nuanced discussion of where they diverge from prior findings. The value-based and perceptual decision-making tasks were carefully matched in terms of stimulus display and motor response, making their comparison credible.

      Weaknesses:

      -I was confused about the model specification in terms of the relationship between evidence level and drift rate. While the methods (and e.g. supplementary figure 3) specify a linear relationship between evidence level and drift rate, suggesting, unless I misunderstood, that only a single drift rate parameter (kappa) is fit. However, the drift rate parameter estimates in the supplementary tables (and response to reviewers) do not scale linearly with evidence level.

      -The fit quality for the value-based decision task is not as good as that for the PDM, and this would be worth commenting on in the paper.

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, participants completed two different tasks. A perceptual choice task in which they compared the sizes of pairs of items and a value-different task in which they identified the higher value option among pairs of items with the two tasks involving the same stimuli. Based on previous fMRI research, the authors sought to determine whether the superior frontal sulcus (SFS) is involved in both perceptual and value-based decisions or just one or the other. Initial fMRI analyses were devised to isolate brain regions that were activated for both types of choices and also regions that were unique to each. Transcranial magnetic stimulation was applied to the SFS in between fMRI sessions and it was found to lead to a significant decrease in accuracy and RT on the perceptual choice task but only a decrease in RT on the value-different task. Hierarchical drift-diffusion modelling of the data indicated that the TMS had led to a lowering of decision boundaries in the perceptual task and a lower of non-decision times on the value-based task. Additional analyses show that SFS covaries with model-derived estimates of cumulative evidence and that this relationship is weakened by TMS.

      Strengths:

      The paper has many strengths including the rigorous multi-pronged approach of causal manipulation, fMRI and computational modelling which offers a fresh perspective on the neural drivers of decision making. Some additional strengths include the careful paradigm design which ensured that the two types of tasks were matched for their perceptual content while orthogonalizing trial-to-trial variations in choice difficulty. The paper also lays out a number of specific hypotheses at the outset regarding the behavioural outcomes that are tied to decision model parameters and are well justified.

      Weaknesses:

      (1.1) Unless I have missed it, the SFS does not actually appear in the list of brain areas significantly activated by the perceptual and value tasks in Supplementary Tables 1 and 2. Its presence or absence from the list of significant activations is not mentioned by the authors when outlining these results in the main text. What are we to make of the fact that it is not showing significant activation in these initial analyses?

      You are right that the left SFS does not appear in our initial task-level contrasts. Those first analyses were deliberately agnostic to evidence accumulation (i.e., average BOLD by task, irrespective of trial-by-trial evidence). Consistent with prior work, SFS emerges only when we model the parametric variation in accumulated perceptual evidence.

      Accordingly, we ran a second-level GLM that included trial-wise accumulated evidence (aE) as a parametric modulator. In that analysis, the left SFS shows significant aE-related activity specifically during perceptual decisions, but not during value-based decisions (SVC in a 10-mm sphere around x = −24, y = 24, z = 36).

      To avoid confusion, we now:

      (i) explicitly separate and label the two analysis levels in the Results; (ii) state up front that SFS is not expected to appear in the task-average contrast; and (iii) add a short pointer that SFS appears once aE is included as a parametric modulator. We also edited Methods to spell out precisely how aE is constructed and entered into GLM2. This should make the logic of the two-stage analysis clearer and aligns the manuscript with the literature where SFS typically emerges only in parametric evidence models.

      (1.2) The value difference task also requires identification of the stimuli, and therefore perceptual decision-making. In light of this, the initial fMRI analyses do not seem terribly informative for the present purposes as areas that are activated for both types of tasks could conceivably be specifically supporting perceptual decision-making only. I would have thought brain areas that are playing a particular role in evidence accumulation would be best identified based on whether their BOLD response scaled with evidence strength in each condition which would make it more likely that areas particular to each type of choice can be identified. The rationale for the authors' approach could be better justified.

      We agree that both tasks require early sensory identification of the items, but the decision-relevant evidence differs by design (size difference vs. value difference), and our modelling is targeted at the evidence integration stage rather than initial identification.

      To address your concern empirically, we: (i) added session-wise plots of mean RTs showing a general speed-up across the experiment (now in the Supplement); (ii) fit a hierarchical DDM to jointly explain accuracy and RT. The DDM dissociates decision time (evidence integration) from non-decision time (encoding/response execution).

      After cTBS, perceptual decisions show a selective reduction of the decision boundary (lower accuracy, faster RTs; no drift-rate change), whereas value-based decisions show no change to boundary/drift but a decrease in non-decision time, consistent with faster sensorimotor processing or task familiarity. Thus, the TMS effect in SFS is specific to the criterion for perceptual evidence accumulation, while the RT speed-up in the value task reflects decision-irrelevant processes. We now state this explicitly in the Results and add the RT-by-run figure for transparency.

      (1.2.1) The value difference task also requires identification of the stimuli, and therefore perceptual decision-making. In light of this, the initial fMRI analyses do not seem terribly informative for the present purposes as areas that are activated for both types of tasks could conceivably be specifically supporting perceptual decision-making only.

      Thank you for prompting this clarification.

      The key point is what changes with cTBS. If SFS supported generic identification, we would expect parallel cTBS effects on drift rate (or boundary) in both tasks. Instead, we find: (a) boundary decreases selectively in perceptual decisions (consistent with SFS setting the amount of perceptual evidence required), and (b) non-decision time decreases selectively in the value task (consistent with speed-ups in encoding/response stages). Moreover, trial-by-trial SFS BOLD predicts perceptual accuracy (controlling for evidence), and neural-DDM model comparison shows SFS activity modulates boundary, not drift, during perceptual choices.

      Together, these converging behavioral, computational, and neural results argue that SFS specifically supports the criterion for perceptual evidence accumulation rather than generic visual identification.

      (1.2.2) I would have thought brain areas that are playing a particular role in evidence accumulation would be best identified based on whether their BOLD response scaled with evidence strength in each condition which would make it more likely that areas particular to each type of choice can be identified. The rationale for the authors' approach could be better justified.

      We now more explicitly justify the two-level fMRI approach. The task-average contrast addresses which networks are generally more engaged by each domain (e.g., posterior parietal for PDM; vmPFC/PCC for VDM), given identical stimuli and motor outputs. This complements, but does not substitute for, the parametric evidence analysis, which is where one expects accumulation-related regions such as SFS to emerge. We added text clarifying that the first analysis establishes domain-specific recruitment at the task level, whereas the second isolates evidence-dependent signals (aE) and reveals that left SFS tracks accumulated evidence only for perceptual choices. We also added explicit references to the literature using similar two-step logic and noted that SFS typically appears only in parametric evidence models.

      (1.3) TMS led to reductions in RT in the value-difference as well as the perceptual choice task. DDM modelling indicated that in the case of the value task, the effect was attributable to reduced non-decision time which the authors attribute to task learning. The reasoning here is a little unclear.

      (1.3.1) Comment: If task learning is the cause, then why are similar non-decision time effects not observed in the perceptual choice task?

      Great point. The DDM addresses exactly this: RT comprises decision time (DT) plus non-decision time (nDT). With cTBS, PDM shows reduced DT (via a lower boundary) but stable nDT; VDM shows reduced nDT with no change to boundary/drift. Hence, the superficially similar RT speed-ups in both tasks are explained by different latent processes: decision-relevant in PDM (lower criterion → faster decisions, lower accuracy) and decision-irrelevant in VDM (faster encoding/response). We added explicit language and a supplemental figure showing RT across runs, and we clarified in the text that only the PDM speed-up reflects a change to evidence integration.

      (1.3.2) Given that the value-task actually requires perceptual decision-making, is it not possible that SFS disruption impacted the speed with which the items could be identified, hence delaying the onset of the value-comparison choice?

      We agree there is a brief perceptual encoding phase at the start of both tasks. If cTBS impaired visual identification per se, we would expect longer nDT in both tasks or a decrease in drift rate. Instead, nDT decreases in the value task and is unchanged in the perceptual task; drift is unchanged in both. Thus, cTBS over SFS does not slow identification; rather, it lowers the criterion for perceptual accumulation (PDM) and, separately, we observe faster non-decision components in VDM (likely familiarity or motor preparation). We added a clarifying sentence noting that item identification was easy and highly overlearned (static, large food pictures), and we cite that nDT is the appropriate locus for identification effects in the DDM framework; our data do not show the pattern expected of impaired identification.

      (1.4) The sample size is relatively small. The authors state that 20 subjects is 'in the acceptable range' but it is not clear what is meant by this.

      We have clarified what we mean and provided citations. The sample (n = 20) matches or exceeds many prior causal TMS/fMRI studies targeting perceptual decision circuitry (e.g., Philiastides et al., 2011; Rahnev et al., 2016; Jackson et al., 2021; van der Plas et al., 2021; Murd et al., 2021). Importantly, we (i) use within-subject, pre/post cTBS differences-in-differences with matched tasks; (ii) estimate hierarchical models that borrow strength across participants; and (iii) converge across behavior, latent parameters, regional BOLD, and connectivity. We now replace the vague phrase with a concrete statement and references, and we report precision (HDIs/SEs) for all main effects.

      Reviewer #2 (Public Review):

      Summary:

      The authors set out to test whether a TMS-induced reduction in excitability of the left Superior Frontal Sulcus influenced evidence integration in perceptual and value-based decisions. They directly compared behaviour - including fits to a computational decision process model - and fMRI pre and post-TMS in one of each type of decision-making task. Their goal was to test domain-specific theories of the prefrontal cortex by examining whether the proposed role of the SFS in evidence integration was selective for perceptual but not value-based evidence.

      Strengths:

      The paper presents multiple credible sources of evidence for the role of the left SFS in perceptual decision-making, finding similar mechanisms to prior literature and a nuanced discussion of where they diverge from prior findings. The value-based and perceptual decision-making tasks were carefully matched in terms of stimulus display and motor response, making their comparison credible.

      Weaknesses:

      (2.1) More information on the task and details of the behavioural modelling would be helpful for interpreting the results.

      Thank you for this request for clarity. In the revision we explicitly state, up front, how the two tasks differ and how the modelling maps onto those differences.

      (1) Task separability and “evidence.” We now define task-relevant evidence as size difference (SD) for perceptual decisions (PDM) and value difference (VD) for value-based decisions (VDM). Stimuli and motor mappings are identical across tasks; only the evidence to be integrated changes.

      (2) Behavioural separability that mirrors task design. As reported, mixed-effects regressions show PDM accuracy increases with SD (β=0.560, p<0.001) but not VD (β=0.023, p=0.178), and PDM RTs shorten with SD (β=−0.057, p<0.001) but not VD (β=0.002, p=0.281). Conversely, VDM accuracy increases with VD (β=0.249, p<0.001) but not SD (β=0.005, p=0.826), and VDM RTs shorten with VD (β=−0.016, p=0.011) but not SD (β=−0.003, p=0.419).

      (3 How the HDDM reflects this. The hierarchical DDM fits the joint accuracy–RT distributions with task-specific evidence (SD or VD) as the predictor of drift. The model separates decision time from non-decision time (nDT), which is essential for interpreting the different RT patterns across tasks without assuming differences in the accumulation process when accuracy is unchanged.

      These clarifications are integrated in the Methods (Experimental paradigm; HDDM) and in Results (“Behaviour: validity of task-relevant pre-requisites” and “Modelling: faster RTs during value-based decisions is related to non-decision-related sensorimotor processes”).

      (2.2) The evidence for a choice and 'accuracy' of that choice in both tasks was determined by a rating task that was done in advance of the main testing blocks (twice for each stimulus). For the perceptual decisions, this involved asking participants to quantify a size metric for the stimuli, but the veracity of these ratings was not reported, nor was the consistency of the value-based ones. It is my understanding that the size ratings were used to define the amount of perceptual evidence in a trial, rather than the true size differences, and without seeing more data the reliability of this approach is unclear. More concerning was the effect of 'evidence level' on behaviour in the value-based task (Figure 3a). While the 'proportion correct' increases monotonically with the evidence level for the perceptual decisions, for the value-based task it increases from the lowest evidence level and then appears to plateau at just above 80%. This difference in behaviour between the two tasks brings into question the validity of the DDM which is used to fit the data, which assumes that the drift rate increases linearly in proportion to the level of evidence.

      We thank the reviewer for raising these concerns, and we address each of them point by point:

      2.2.1. Comment: It is my understanding that the size ratings were used to define the amount of perceptual evidence in a trial, rather than the true size differences, and without seeing more data the reliability of this approach is unclear.

      That is correct—we used participants’ area/size ratings to construct perceptual evidence (SD).

      To validate this choice, we compared those ratings against an objective image-based size measure (proportion of non-black pixels within the bounding box). As shown in Author response image 3, perceptual size ratings are highly correlated with objective size across participants (Pearson r values predominantly ≈0.8 or higher; all p<0.001). Importantly, value ratings do not correlate with objective size (Author response image 2), confirming that the two rating scales capture distinct constructs. These checks support using participants’ size ratings as the participant-specific ground truth for defining SD in the PDM trials.

      Author response image 1.

      Objective size and value ratings are unrelated. Scatterplots show, for each participant, the correlation between objective image size (x-axis; proportion of non-black pixels within the item box) and value-based ratings (y-axis; 0–100 scale). Each dot is one food item (ratings averaged over the two value-rating repetitions). Across participants, value ratings do not track objective size, confirming that value and size are distinct constructs.

      Author response image 2.

      Perceptual size ratings closely track objective size. Scatterplots show, for each participant, the correlation between objective image size (x-axis) and perceptual area/size ratings (y-axis; 0–100 scale). Each dot is one food item (ratings averaged over the two perceptual ratings). Perceptual ratings are strongly correlated with objective size for nearly all participants (see main text), validating the use of these ratings to construct size-difference evidence (SD).

      (2.2.2) More concerning was the effect of 'evidence level' on behaviour in the value-based task (Figure 3a). While the 'proportion correct' increases monotonically with the evidence level for the perceptual decisions, for the value-based task it increases from the lowest evidence level and then appears to plateau at just above 80%. This difference in behaviour between the two tasks brings into question the validity of the DDM which is used to fit the data, which assumes that the drift rate increases linearly in proportion to the level of evidence.

      We agree that accuracy appears to asymptote in VDM, but the DDM fits indicate that the drift rate still increases monotonically with evidence in both tasks. In Supplementary figure 11, drift (δ) rises across the four evidence levels for PDM and for VDM (panels showing all data and pre/post-TMS). The apparent plateau in proportion correct during VDM reflects higher choice variability at stronger preference differences, not a failure of the drift–evidence mapping. Crucially, the model captures both the accuracy patterns and the RT distributions (see posterior predictive checks in Supplementary figures 11-16), indicating that a monotonic evidence–drift relation is sufficient to account for the data in each task.

      Author response image 3.

      HDDM parameters by evidence level. Group-level posterior means (± posterior SD) for drift (δ), boundary (α), and non-decision time (τ) across the four evidence levels, shown (a) collapsed across TMS sessions, (b) for PDM (blue) pre- vs post-TMS (light vs dark), and (c) for VDM (orange) pre- vs post-TMS. Crucially, drift increases monotonically with evidence in both tasks, while TMS selectively lowers α in PDM and reduces τ in VDM (see Supplementary Tables for numerical estimates).

      (2.3) The paper provides very little information on the model fits (no parameter estimates, goodness of fit values or simulated behavioural predictions). The paper finds that TMS reduced the decision bound for perceptual decisions but only affected non-decision time for value-based decisions. It would aid the interpretation of this finding if the relative reliability of the fits for the two tasks was presented.

      We appreciate the suggestion and have made the quantitative fit information explicit:

      (1) Parameter estimates. Group-level means/SDs for drift (δ), boundary (α), and nDT (τ) are reported for PDM and VDM overall, by evidence level, pre- vs post-TMS, and per subject (see Supplementary Tables 8-11).

      (2) Goodness of fit and predictive adequacy. DIC values accompany each fit in the tables. Posterior predictive checks demonstrate close correspondence between simulated and observed accuracy and RT distributions overall, by evidence level, and across subjects (Supplementary Figures 11-16).

      Together, these materials document that the HDDM provides reliable fits in both tasks and accurately recovers the qualitative and quantitative patterns that underlie our inferences (reduced α for PDM only; selective τ reduction in VDM).

      (2.4) Behaviourally, the perceptual task produced decreased response times and accuracy post-TMS, consistent with a reduced bound and consistent with some prior literature. Based on the results of the computational modelling, the authors conclude that RT differences in the value-based task are due to task-related learning, while those in the perceptual task are 'decision relevant'. It is not fully clear why there would be such significantly greater task-related learning in the value-based task relative to the perceptual one. And if such learning is occurring, could it potentially also tend to increase the consistency of choices, thereby counteracting any possible TMS-induced reduction of consistency?

      Thank you for pointing out the need for a clearer framing. We have removed the speculative label “task-related learning” and now describe the pattern strictly in terms of the HDDM decomposition and neural results already reported:

      (1) VDM: Post-TMS RTs are faster while accuracy is unchanged. The HDDM attributes this to a selective reduction in non-decision time (τ), with no change in decision-relevant parameters (α, δ) for VDM (see Supplementary Figure 11 and Supplementary Tables). Consistent with this, left SFS BOLD is not reduced for VDM, and trialwise SFS activity does not predict VDM accuracy—both observations argue against a change in VDM decision formation within left SFS.

      (2) PDM: Post-TMS accuracy decreases and RTs shorten, which the HDDM captures as a lower decision boundary (α) with no change in drift (δ). Here, left SFS BOLD scales with accumulated evidence and decreases post-TMS, and trialwise SFS activity predicts PDM accuracy, all consistent with a decision-relevant effect in PDM.

      Regarding the possibility that faster VDM RTs should increase choice consistency: empirically, consistency did not change in VDM, and the HDDM finds no decision-parameter shifts there. Thus, there is no hidden counteracting increase in VDM accuracy that could mask a TMS effect—the absence of a VDM accuracy change is itself informative and aligns with the modelling and fMRI.

      Reviewer #3 (Public Review):

      Summary:

      Garcia et al., investigated whether the human left superior frontal sulcus (SFS) is involved in integrating evidence for decisions across either perceptual and/or value-based decision-making. Specifically, they had 20 participants perform two decision-making tasks (with matched stimuli and motor responses) in an fMRI scanner both before and after they received continuous theta burst transcranial magnetic stimulation (TMS) of the left SFS. The stimulation thought to decrease neural activity in the targeted region, led to reduced accuracy on the perceptual decision task only. The pattern of results across both model-free and model-based (Drift diffusion model) behavioural and fMRI analyses suggests that the left SLS plays a critical role in perceptual decisions only, with no equivalent effects found for value-based decisions. The DDM-based analyses revealed that the role of the left SLS in perceptual evidence accumulation is likely to be one of decision boundary setting. Hence the authors conclude that the left SFS plays a domain-specific causal role in the accumulation of evidence for perceptual decisions. These results are likely to add importance to the literature regarding the neural correlates of decision-making.

      Strengths:

      The use of TMS strengthens the evidence for the left SFS playing a causal role in the evidence accumulation process. By combining TMS with fMRI and advanced computational modelling of behaviour, the authors go beyond previous correlational studies in the field and provide converging behavioural, computational, and neural evidence of the specific role that the left SFS may play.

      Sophisticated and rigorous analysis approaches are used throughout.

      Weaknesses:

      (3.1) Though the stimuli and motor responses were equalised between the perception and value-based decision tasks, reaction times (according to Figure 1) and potential difficulty (Figure 2) were not matched. Hence, differences in task difficulty might represent an alternative explanation for the effects being specific to the perception task rather than domain-specificity per se.

      We agree that RTs cannot be matched a priori, and we did not intend them to be. Instead, we equated the inputs to the decision process and verified that each task relied exclusively on its task-relevant evidence. As reported in Results—Behaviour: validity of task-relevant pre-requisites (Fig. 1b–c), accuracy and RTs vary monotonically with the appropriate evidence regressor (SD for PDM; VD for VDM), with no effect of the task-irrelevant regressor. This separability check addresses differences in baseline RTs by showing that, for both tasks, behaviour tracks evidence as designed.

      To rule out a generic difficulty account of the TMS effect, we relied on the within-subject differences-in-differences (DID) framework described in Methods (Differences-in-differences). The key Task × TMS interaction compares the pre→post change in PDM with the pre→post change in VDM while controlling for trialwise evidence and RT covariates. Any time-on-task or unspecific difficulty drift shared by both tasks is subtracted out by this contrast. Using this specification, TMS selectively reduced accuracy for PDM but not VDM (Fig. 3a; Supplementary Fig. 2a,c; Supplementary Tables 5–7).

      Finally, the hierarchical DDM (already in the paper) dissociates latent mechanisms. The post-TMS boundary reduction appears only in PDM, whereas VDM shows a change in non-decision time without a decision-relevant parameter change (Fig. 3c; Supplementary Figs. 4–5). If unmatched difficulty were the sole driver, we would expect parallel effects across tasks, which we do not observe.

      (3.2) No within- or between-participants sham/control TMS condition was employed. This would have strengthened the inference that the apparent TMS effects on behavioural and neural measures can truly be attributed to the left SFS stimulation and not to non-specific peripheral stimulation and/or time-on-task effects.

      We agree that a sham/control condition would further strengthen causal attribution and note this as a limitation. In mitigation, our design incorporates several safeguards already reported in the manuscript:

      · Within-subject pre/post with alternating task blocks and DID modelling (Methods) to difference out non-specific time-on-task effects.

      · Task specificity across levels of analysis: behaviour (PDM accuracy reduction only), computational (boundary reduction only in PDM; no drift change), BOLD (reduced left-SFS accumulated-evidence signal for PDM but not VDM; Fig. 4a–c), and functional coupling (SFS–occipital PPI increase during PDM only; Fig. 5).

      · Matched stimuli and motor outputs across tasks, so any peripheral sensations or general arousal effects should have influenced both tasks similarly; they did not.

      Together, these converging task-selective effects reduce the likelihood that the results reflect non-specific stimulation or time-on-task. We will add an explicit statement in the Limitations noting the absence of sham/control and outlining it as a priority for future work.

      (3.3) No a priori power analysis is presented.

      We appreciate this point. Our sample size (n = 20) matched prior causal TMS and combined TMS–fMRI studies using similar paradigms and analyses (e.g., Philiastides et al., 2011; Rahnev et al., 2016; Jackson et al., 2021; van der Plas et al., 2021; Murd et al., 2021), and was chosen a priori on that basis and the practical constraints of cTBS + fMRI. The within-subject DID approach and hierarchical modelling further improve efficiency by leveraging all trials.

      To address the reviewer’s request for transparency, we will (i) state this rationale in Methods—Participants, and (ii) ensure that all primary effects are reported with 95% CIs or posterior probabilities (already provided for the HDDM as pmcmcp_{\mathrm{mcmc}}pmcmc). We also note that the design was sensitive enough to detect RT changes in both tasks and a selective accuracy change in PDM, arguing against a blanket lack of power as an explanation for null VDM accuracy effects. We will nevertheless flag the absence of a formal prospective power analysis in the Limitations.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations For The Authors):

      Some important elements of the methods are missing. How was the site for targeting the SFS with TMS identified? The methods described how M1 was located but not SFS.

      Thank you for catching this omission. In the revised Methods we explicitly describe how the left SFS target was localized. Briefly, we used each participant’s T1-weighted anatomical scan and frameless neuronavigation to place a 10-mm sphere at the a priori MNI coordinates (x = −24, y = 24, z = 36) derived from prior work (Heekeren et al., 2004; Philiastides et al., 2011). This sphere was transformed to native space for each participant. The coil was positioned tangentially with the handle pointing posterior-lateral, and coil placement was continuously monitored with neuronavigation throughout stimulation. (All of these procedures mirror what we already report for M1 and are now stated for SFS as well.)

      Where to revise the manuscript:

      Methods → Stimulation protocol. After the first sentence naming cTBS, insert:<br /> “The left SFS target was localized on each participant’s T1-weighted anatomical image using frameless neuronavigation. A 10-mm radius sphere was centered at the a priori MNI coordinates x = −24, y = 24, z = 36 (Heekeren et al., 2004; Philiastides et al., 2011), then transformed to native space. The MR-compatible figure-of-eight coil was positioned tangentially over the target with the handle oriented posterior-laterally, and its position was tracked and maintained with neuronavigation during stimulation.”

      It is not clear how participants were instructed that they should perform the value-difference task. Were they told that they should choose based on their original item value ratings or was it left up to them?

      We agree the instruction should be explicit. Participants were told_: “In value-based blocks, choose the item you would prefer to eat at the end of the experiment.”_ They were informed that one VDM trial would be randomly selected for actual consumption, ensuring incentive-compatibility. We did not ask them to recall or follow their earlier ratings; those ratings were used only to construct evidence (value difference) and to define choice consistency offline.

      Where to revise the manuscript:

      Methods → Experimental paradigm.

      Add a sentence to the VDM instruction paragraph:

      “In value-based (LIKE) blocks, participants were instructed to choose the item they would prefer to consume at the end of the experiment; one VDM trial was randomly selected and implemented, making choices incentive-compatible. Prior ratings were used solely to construct value-difference evidence and to score choice consistency; participants were not asked to recall or match their earlier ratings.”

      Line 86 Introduction, some previous studies were conducted on animals. Why it is problematic that the studies were conducted in animals is not stated. I assume the authors mean that we do not know if their findings will translate to the human brain? I think in fairness to those working with animals it might be worth an extra sentence to briefly expand on this point.

      We appreciate this and will clarify that animal work is invaluable for circuit-level causality, but species differences and putative non-homologous areas (e.g., human SFS vs. rodent FOF) limit direct translation. Our point is not that animal studies are problematic, but that establishing causal roles in humans remains necessary.

      Revision:

      Introduction (paragraph discussing prior animal work). Replace the current sentence beginning “However, prior studies were largely correlational”

      “Animal studies provide critical causal insights, yet direct translation to humans can be limited by species-specific anatomy and potential non-homologies (e.g., human SFS vs. frontal orienting fields in rodents). Therefore, establishing causal contributions in the human brain remains essential.”

      Line 100-101: "or whether its involvement is peripheral and merely functionally supporting a larger system" - it is not clear what you mean by 'supporting a larger system'

      We meant that observed SFS activity might reflect upstream/downstream support processes (e.g., attentional control or working-memory maintenance) rather than the computation of evidence accumulation itself. We have rephrased to avoid ambiguity.

      Revision:

      Introduction. Replace the phrase with:

      “or whether its observed activity reflects upstream or downstream support processes (e.g., attention or working-memory maintenance) rather than the accumulation computation per se.”

      The authors do have to make certain assumptions about the BOLD patterns that would be expected of an evidence accumulation region. These assumptions are reasonable and have been adopted in several previous neuroimaging studies. Nevertheless, it should be acknowledged that alternative possibilities exist and this is an inevitable limitation of using fMRI to study decision making. For example, if it turns out that participants collapse their boundaries as time elapses, then the assumption that trials with weaker evidence should have larger BOLD responses may not hold - the effect of more prolonged activity could be cancelled out by the lower boundaries. Again, I think this is just a limitation that could be acknowledged in the Discussion, my opinion is that this is the best effort yet to identify choice-relevant regions with fMRI and the authors deserve much credit for their rigorous approach.

      Agreed. We already ground our BOLD regressors in the DDM literature, but acknowledge that alternative mechanisms (e.g., time-dependent boundaries) can alter expected BOLD–evidence relations. We now add a short limitation paragraph stating this explicitly.

      Revision:

      Discussion (limitations paragraph). Add:

      “Our fMRI inferences rest on model-based assumptions linking accumulated evidence to BOLD amplitude. Alternative mechanisms—such as time-dependent (collapsing) boundaries—could attenuate the prediction that weaker-evidence trials yield longer accumulation and larger BOLD signals. While our behavioural and neural results converge under the DDM framework, we acknowledge this as a general limitation of model-based fMRI.”

      Reviewer #2 (Recommendations For The Authors):

      Minor points

      I suggest the proportion of missed trials should be reported.

      Thank you for the suggestion. In our preprocessing we excluded trials with no response within the task’s response window and any trials failing a priori validity checks. Because non-response trials contain neither a choice nor an RT, they are not entered into the DDM fits or the fMRI GLMs and, by design, carry no weight in the reported results. To keep the focus on the data that informed all analyses, we now (i) state the trial-inclusion criteria explicitly and (ii) report the number of analysed (valid) trials per task and run. This conveys the effective sample size contributing to each condition without altering the analysis set.

      Revision:

      Methods → (at the end of “Experimental paradigm”): “Analyses were conducted on valid trials only, defined as trials with a registered response within the task’s response window and passing pre-specified validity checks; trials without a response were excluded and not analysed.”

      Results → “Behaviour: validity of task-relevant pre-requisites” (add one sentence at the end of the first paragraph): “All behavioural and fMRI analyses were performed on valid trials only (see Methods for inclusion criteria).”

      Figure 4 c is very confusing. Is the legend or caption backwards?

      Thanks for flagging. We corrected the Figure 4c caption to match the colouring and contrasts used in the panel (perceptual = blue/green overlays; value-based = orange/red; ‘post–pre’ contrasts explicitly labeled). No data or analyses were changed, just the wording to remove ambiguity.

      Revision:

      Figure 4 caption (panel c sentence). Replace with:

      “(c) Post–pre contrasts for the trialwise accumulated-evidence regressor show reduced left-SFS BOLD during perceptual decisions (green overlay), with a significantly stronger reduction for perceptual vs value-based decisions (blue overlay). No reduction is observed for value-based decisions.”

      Even if not statistically significant it may be of interest to add the results for Value-based decision making on SFS in Supplementary Table 3.

      Done. We now include the SFS small-volume results for VDM (trialwise accumulated-evidence regressor) alongside the PDM values in the same table, with exact peak, cluster size, and statistics.

      Revision:

      Supplementary Table 3 (title):

      “Regions encoding trialwise accumulated evidence (parametric modulation) during perceptual and value-based decisions, including SFS SVC results for both tasks.”

      Model comparisons: please explain how model complexity is accounted for.

      We clarify that model evidence was compared using the Deviance Information Criterion (DIC), which penalizes model fit by an effective number of parameters (pD). Lower DIC indicates better out-of-sample predictive performance after accounting for model complexity.

      Revision:

      Methods → Hierarchical Bayesian neural-DDM (last paragraph). Add:

      “Model comparison used the Deviance Information Criterion (DIC = D̄ + pD), where pD is the effective number of parameters; thus DIC penalizes model complexity. Lower DIC denotes better predictive accuracy after accounting for complexity.”

      Reviewer #3 (Recommendations For The Authors):

      The following issues would benefit from clarification in the manuscript:

      - It is stated that "Our sample size is well within acceptable range, similar to that of previous TMS studies." The sample size being similar to previous studies does not mean it is within an acceptable range. Whether the sample size is acceptable or not depends on the expected effect size. It is perfectly possible that the previous studies cited were all underpowered. What implications might the lack of an a priori power analysis have for the interpretation of the results?

      We agree and have revised our wording. We did not conduct an a priori power analysis. Instead, we relied on a within-participant design that typically yields higher sensitivity in TMS–fMRI settings and on convergence across behavioural, computational, and neural measures. We now acknowledge that the absence of formal power calculations limits claims about small effects (particularly for null findings in VDM), and we frame those null results cautiously.

      Revision:

      Discussion (limitations). Add:

      “The within-participant design enhances statistical sensitivity, yet the absence of an a priori power analysis constrains our ability to rule out small effects, particularly for null results in VDM.”

      - I was confused when trying to match the results described in the 'Behaviour: validity of task-relevant pre-requisites' section on page 6 to what is presented in Figure 1. Specifically, Figure 1C is cited 4 times but I believe two of these should be citing Figure 1B?

      Thank you—this was a citation mix-up. The two places that referenced “Fig. 1C” but described accuracy should in fact point to Fig. 1B. We corrected both citations.

      Revision:

      Results → Behaviour: validity… Change the two incorrect “Fig. 1C” references (when describing accuracy) to “Fig. 1B”.

      - Also, where is the 'SD' coefficient of -0.254 (p-value = 0.123) coming from in line 211? I can't match this to the figure.

      This was a typographical error in an earlier draft. The correct coefficients are those shown in the figure and reported elsewhere in the text (evidence-specific effects: for PDM RTs, SD β = −0.057, p < 0.001; for VDM RTs, VD β = −0.016, p = 0.011; non-relevant evidence terms are n.s.). We removed the erroneous value.

      Revision:

      Results → Behaviour: validity… (sentence with −0.254). Delete the incorrect value and retain the evidence-specific coefficients consistent with Fig. 1B–C.

      - It is reported that reaction times were significantly faster for the perceptual relative to the value-based decision task. Was overall accuracy also significantly different between the two tasks? It appears from Figure 3 that it might be, But I couldn't find this reported in the text.

      To avoid conflating task with evidence composition, we did not emphasize between-task accuracy averages. Our primary tests examine evidence-specific effects and TMS-induced changes within task. For completeness, we now report descriptive mean accuracies by task and point readers to the figure panels that display accuracy as a function of evidence (which is the meaningful comparison in our matched-evidence design). We refrain from additional hypothesis testing here to keep the analyses aligned with our preregistered focus.

      Revision:

      Results → Behaviour: validity… Add:

      “For completeness, group-mean accuracies by task are provided descriptively in Fig. 3a; inferential tests in the manuscript focus on evidence-specific effects and TMS-induced changes within task.”

    1. eLife Assessment

      This important study presents a cross-species and cross-disciplinary analysis of cortical folding. The authors use a combination of physical gel models, computational simulations, and morphometric analysis, extending prior work in human brain development to macaques and ferrets. The findings support the hypothesis that mechanical forces driven by differential growth can account for major aspects of gyrification. The evidence presented is overall strong and convincingly supports the central claims; the findings will be of broad interest in developmental neuroscience.

    2. Reviewer #1 (Public review):

      The manuscript by Yin and colleagues addresses a long-standing question in the field of cortical morphogenesis, regarding factors that determine differential cortical folding across species and individuals with cortical malformations. The authors present work based on a computational model of cortical folding evaluated alongside a physical model that makes use of gel swelling to investigate the role of a two-layer model for cortical morphogenesis. The study assesses these models against empirically derived cortical surfaces based on MRI data from ferret, macaque monkey, and human brains.

      The manuscript is clearly written and presented, and the experimental work (physical gel modeling as well as numerical simulations) and analyses (subsequent morphometric evaluations) are conducted at the highest methodological standards. It constitutes an exemplary use of interdisciplinary approaches for addressing the question of cortical morphogenesis by bringing together well-tuned computational modeling with physical gel models. In addition, the comparative approaches used in this paper establish a foundation for broad-ranging future lines of work that investigate the impact of perturbations or abnormalities during cortical development.

      The cross-species approach taken in this study is a major strength of the work. However, correspondence across the two methodologies did not appear to be equally consistent in predicting brain folding across all three species. The results presented in Figures 4 (and Figures S3 & S4) show broad correspondence in shape index and major sulci landmarks across all three species. Nevertheless, the results presented for the human brain lack the same degree of clear correspondence for the gel model results as observed in the macaque and ferret. While this study clearly establishes a strong foundation for comparative cortical anatomy across species and the impact of perturbations on individual morphogenesis, further work that fine-tunes physical modeling of complex morphologies, such as that of the human cortex, may help to further understand the factors that determine cortical functionalization and pathologies.

    3. Reviewer #2 (Public review):

      This manuscript explores the mechanisms underlying cerebral cortical folding using a combination of physical modelling, computational simulations, and geometric morphometrics. The authors extend their prior work on human brain development (Tallinen et al., 2014; 2016) to a comparative framework involving three mammalian species: ferrets (Carnivora), macaques (Old World monkeys), and humans (Hominoidea). By integrating swelling gel experiments with mathematical differential growth models, they simulate sulcification instability and recapitulate key features of brain folding across species. The authors make commendable use of publicly available datasets to construct 3D models of fetal and neonatal brain surfaces: fetal macaque (ref. [26]), newborn ferret (ref. [11]), and fetal human (ref. [22]).

      Using a combination of physical models and numerical simulations, the authors compare the resulting folding morphologies to real brain surfaces using morphometric analysis. Their results show qualitative and quantitative concordance with observed cortical folding patterns, supporting the view that differential tangential growth of the cortex relative to the subcortical substrate is sufficient to account for much of the diversity in cortical folding. This is a very important point in our field, and can be used in the teaching of medical students.

      Brain folding remains a topic of ongoing debate. While some regard it as a critical specialization linked to higher cognitive function, others consider it an epiphenomenon of expansion and constrained geometry. This divergence was evident in discussions during the Strüngmann Forum on cortical development (Silver et al., 2019). Though folding abnormalities are reliable indicators of disrupted neurodevelopmental processes (e.g., neurogenesis, migration), their relationship to functional architecture remains unclear. Recent evidence suggests that the absolute number of neurons varies significantly with position-sulcus versus gyrus-with potential implications for local processing capacity (e.g., https://doi.org/10.1002/cne.25626). The field is thus in need of comparative, mechanistic studies like the present one.

      This paper offers an elegant and timely contribution by combining gel-based morphogenesis, numerical modelling, and morphometric analysis to examine cortical folding across species. The experimental design - constructing two-layer PDMS models from 3D MRI data and immersing them in organic solvents to induce differential swelling - is well-established in prior literature. The authors further complement this with a continuum mechanics model simulating folding as a result of differential growth, as well as a comparative analysis of surface morphologies derived from in vivo, in vitro, and in silico brains.

      Conclusion:

      This is a well-executed and creative study that integrates diverse methodologies to address a longstanding question in developmental neurobiology. While a few aspects-such as regional folding peculiarities, sensitivity to initial conditions, and available human data-could be further elaborated, they do not detract from the overall quality and novelty of the work. I enthusiastically support this paper and believe that it will be of broad interest to the neuroscience, biomechanics, and developmental biology communities.

      [Editor's note: The reviewers were satisfied with the authors' response. The eLife Assessment was slightly updated to reflect the author's response.]

    4. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The manuscript by Yin and colleagues addresses a long-standing question in the field of cortical morphogenesis, regarding factors that determine differential cortical folding across species and individuals with cortical malformations. The authors present work based on a computational model of cortical folding evaluated alongside a physical model that makes use of gel swelling to investigate the role of a two-layer model for cortical morphogenesis. The study assesses these models against empirically derived cortical surfaces based on MRI data from ferret, macaque monkey, and human brains.

      The manuscript is clearly written and presented, and the experimental work (physical gel modeling as well as numerical simulations) and analyses (subsequent morphometric evaluations) are conducted at the highest methodological standards. It constitutes an exemplary use of interdisciplinary approaches for addressing the question of cortical morphogenesis by bringing together well-tuned computational modeling with physical gel models. In addition, the comparative approaches used in this paper establish a foundation for broad-ranging future lines of work that investigate the impact of perturbations or abnormalities during cortical development.

      The cross-species approach taken in this study is a major strength of the work. However, correspondence across the two methodologies did not appear to be equally consistent in predicting brain folding across all three species. The results presented in Figures 4 (and Figures S3 and S4) show broad correspondence in shape index and major sulci landmarks across all three species. Nevertheless, the results presented for the human brain lack the same degree of clear correspondence for the gel model results as observed in the macaque and ferret. While this study clearly establishes a strong foundation for comparative cortical anatomy across species and the impact of perturbations on individual morphogenesis, further work that fine-tunes physical modeling of complex morphologies, such as that of the human cortex, may help to further understand the factors that determine cortical functionalization and pathologies.

      We thank the reviewer for positive opinions and helpful comments. Yes, the physical gel model of the human brain has a lower similarity index with the real brain. There are several reasons.

      First, the highly convoluted human cortex has a few major folds (primary sulci) and a very large number of minor folds associated with secondary or tertiary sulci (on scales of order comparable to the cortical thickness), relative to the ferret and macaque cerebral cortex. In our gel model, the exact shapes, positions, and orientations of these minor folds are stochastic, which makes it hard to have a very high similarity index of the gel models when compared with the brain of a single individual.

      Second, in real human brains, these minor folds evolve dynamically with age and show differences among individuals. In experiments with the gel brain, multiscale folds form and eventually disappear as the swelling progresses through the thickness. Our physical model results are snapshots during this dynamical process, which makes it hard to have a concrete one-to-one correspondence between the instantaneous shapes of the swelling gel and the growing human brain.

      Third, the growth of the brain cortex is inhomogeneous in space and varying with time, whereas, in the gel model, swelling is relatively homogeneous.

      We agree that further systematic work, based on our proposed methods, with more fine-tuned gel geometries and properties, might provide a deeper understanding of the relations between brain geometry, and growth-induced folds and their functionalization and pathologies. Further analysis of cortical pathologies using computational and physical gel models can be found in our companion paper (Choi et al., 2025), also published in eLife:

      G. P. T. Choi, C. Liu, S. Yin, G. Séjourné, R. S. Smith, C. A. Walsh, L. Mahadevan, Biophysical basis for brain folding and misfolding patterns in ferrets and humans. eLife, 14, RP107141, 2025. doi:10.7554/eLife.107141

      Reviewer# 2 (Public review):

      This manuscript explores the mechanisms underlying cerebral cortical folding using a combination of physical modelling, computational simulations, and geometric morphometrics. The authors extend their prior work on human brain development (Tallinen et al., 2014; 2016) to a comparative framework involving three mammalian species: ferrets (Carnivora), macaques (Old World monkeys), and humans (Hominoidea). By integrating swelling gel experiments with mathematical differential growth models, they simulate sulcification instability and recapitulate key features of brain folding across species. The authors make commendable use of publicly available datasets to construct 3D models of fetal and neonatal brain surfaces: fetal macaque (ref. [26]), newborn ferret (ref. [11]), and fetal human (ref. [22]).

      Using a combination of physical models and numerical simulations, the authors compare the resulting folding morphologies to real brain surfaces using morphometric analysis. Their results show qualitative and quantitative concordance with observed cortical folding patterns, supporting the view that differential tangential growth of the cortex relative to the subcortical substrate is sufficient to account for much of the diversity in cortical folding. This is a very important point in our field, and can be used in the teaching of medical students.

      Brain folding remains a topic of ongoing debate. While some regard it as a critical specialization linked to higher cognitive function, others consider it an epiphenomenon of expansion and constrained geometry. This divergence was evident in discussions during the Strungmann Forum on cortical development (Silver¨ et al., 2019). Though folding abnormalities are reliable indicators of disrupted neurodevelopmental processes (e.g., neurogenesis, migration), their relationship to functional architecture remains unclear. Recent evidence suggests that the absolute number of neurons varies significantly with position-sulcus versus gyrus-with potential implications for local processing capacity (e.g., https://doi.org/10.1002/cne.25626). The field is thus in need of comparative, mechanistic studies like the present one.

      This paper offers an elegant and timely contribution by combining gel-based morphogenesis, numerical modelling, and morphometric analysis to examine cortical folding across species. The experimental design - constructing two-layer PDMS models from 3D MRI data and immersing them in organic solvents to induce differential swelling - is well-established in prior literature. The authors further complement this with a continuum mechanics model simulating folding as a result of differential growth, as well as a comparative analysis of surface morphologies derived from in vivo, in vitro, and in silico brains.

      We thank the reviewer for the very positive comments.

      I offer a few suggestions here for clarification and further exploration:

      Major Comments

      (1) Choice of Developmental Stages and Initial Conditions

      The authors should provide a clearer justification for the specific developmental stages chosen (e.g., G85 for macaque, GW23 for human). How sensitive are the resulting folding patterns to the initial surface geometry of the gel models? Given that folding is a nonlinear process, early geometric perturbations may propagate into divergent morphologies. Exploring this sensitivity-either through simulations or reference to prior work-would enhance the robustness of the findings.

      The initial geometry is one of the important factors that decides the final folding pattern. The smooth brain in the early developmental stage shows a broad consistency across individuals, and we expect the main folds to form similarly across species and individuals.

      Generally, we choose the initial geometry when the brain cortex is still relatively smooth. For the human, this corresponds approximately to GW23, as the major folds such as the Rolandic fissure (central sulcus), arise during this developmental stage. For the macaque brain, we chose developmental stage G85, primarily because of the availability of the dataset corresponding to this time, which also corresponds to the least folded.

      We expect that large-scale folding patterns are strongly sensitive to the initial geometry but fine-scale features are not. Since our goal is to explain the large-scale features, we expect sensitivity to the initial shape.

      Below are some references of other researchers that are consistent with this idea. Figure 4 from Wang et al. shows some images of simulations obtained by perturbing the geometry of a sphere to an ellipsoid. We see that the growth-induced folds mostly maintain their width (wavelength), but change their orientations.

      Reference:

      Wang, X., Lefévre, J., Bohi, A., Harrach, M.A., Dinomais, M. and Rousseau, F., 2021. The influence of biophysical parameters in a biomechanical model of cortical folding patterns. Scientific Reports, 11(1), p.7686.

      Related results from the same group show that slight perturbations of brain geometry, cause these folds also tend to change their orientations but not width/wavelength (Bohi et al., 2019).

      Reference:

      Bohi, A., Wang, X., Harrach, M., Dinomais, M., Rousseau, F. and Lefévre, J., 2019, July. Global perturbation of initial geometry in a biomechanical model of cortical morphogenesis. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (pp. 442-445). IEEE.

      Finally, a systematic discussion of the role of perturbations on the initial geometries and physical properties can be seen in our work on understanding a different system, gut morphogenesis (Gill et al., 2024).

      We have added the discussion about geometric sensitivity in the section Methods-Numerical Simulations:

      “Small perturbations on initial geometry would affect minor folds, but the main features of major folds, such as orientations, width, and depth, are expected to be conserved across individuals [49, 50]. For simplicity, we do not perturb the fetal brain geometry obtained from datasets.”

      (2) Parameter Space and Breakdown Points

      The numerical model assumes homogeneous growth profiles and simplifies several aspects of cortical mechanics. Parameters such as cortical thickness, modulus ratios, and growth ratios are described in Table II. It would be informative to discuss the range of parameter values for which the model remains valid, and under what conditions the physical and computational models diverge. This would help delineate the boundaries of the current modelling framework and indicate directions for refinement.

      Exploring the valid parameter space is a key problem. We have tested a series of growth parameters and will state them explicitly in our revision. In the current version, we chose the ones that yield a relatively high similarity index to the animal brains. More generally, folding patterns are largely regulated by geometry as well as physical parameters, such as cortical thickness, modulus ratios, growth ratios, and inhomogeneity. In our previous work on a different system, gut morphogenesis, where similar folding patterns are seen, we have explored these features (Gill et al., 2024).

      Reference:

      Gill, H.K., Yin, S., Nerurkar, N.L., Lawlor, J.C., Lee, C., Huycke, T.R., Mahadevan, L. and Tabin, C.J., 2024. Hox gene activity directs physical forces to differentially shape chick small and large intestinal epithelia. Developmental Cell, 59(21), pp.2834-2849.

      (3) Neglected Regional Features: The Occipital Pole of the Macaque

      One conspicuous omission is the lack of attention to the occipital pole of the macaque, which is known to remain smooth even at later gestational stages and has an unusually high neuronal density (2.5× higher than adjacent cortex). This feature is not reproduced in the gel or numerical models, nor is it discussed. Acknowledging this discrepancy-and speculating on possible developmental or mechanical explanationswould add depth to the comparative analysis. The authors may wish to include this as a limitation or a target for future work.

      Yes, we have added that the omission of the Occipital Pole of the macaque is one of our paper’s limitations. Our main aim in this paper is to explore the formation of large-scale folds, so the smooth region is not discussed. But future work could include this to make the model more complete.

      The main text has been modified in Methods, Numerical simulations:

      “To focus on fold formation, we did not discuss the relatively smooth region, such as the Occipital Pole of the macaque.”

      and also in the caption of Figure 4: “... The occipital pole region of macaque brains remains smooth in real and simulated brains.”

      (4) Spatio-Temporal Growth Rates and Available Human Data

      The authors note that accurate, species-specific spatio-temporal growth data are lacking, limiting the ability to model inhomogeneous cortical expansion. While this may be true for ferret and macaque, there are high-quality datasets available for human fetal development, now extended through ultrasound imaging (e.g., https://doi.org/10.1038/s41586-023-06630-3). Incorporating or at least referencing such data could improve the fidelity of the human model and expand the applicability of the approach to clinical or pathological scenarios.

      We thank the reviewer for pointing out the very useful datasets that exist for the exploration of inhomogeneous growth driven folding patterns. We have referred to this paper to provide suggestions for further work in exploring the role of growth inhomogeneities.

      We have referred to this high-quality dataset in our main text, Discussion:

      “...the effect of inhomogeneous growth needs to be further investigated by incorporating regional growth of the gray and white matter not only in human brains [29, 31] based on public datasets [45], but also in other species.”

      A few works have tried to incorporate inhomogeneous growth in simulating human brain folding by separating the central sulcus area into several lobes (e.g., lobe parcellation method, Wang, PhD Thesis, 2021). Since our goal in this paper is to explain the large-scale features of folding in a minimal setting, we have kept our model simple and show that it is still capable of capturing the main features of folding in a range of mammalian brains.

      Reference:

      Xiaoyu Wang. Modélisation et caractérisation du plissement cortical. Signal and Image Processing. Ecole nationale superieure Mines-Télécom Atlantique, 2021. English. 〈NNT : 2021IMTA0248〉.

      (5) Future Applications: The Inverse Problem and Fossil Brains

      The authors suggest that their morphometric framework could be extended to solve the inverse growth problem-reconstructing fetal geometries from adult brains. This speculative but intriguing direction has implications for evolutionary neuroscience, particularly the interpretation of fossil endocasts. Although beyond the scope of this paper, I encourage the authors to elaborate briefly on how such a framework might be practically implemented and validated.

      For the inverse problem, we could use the following strategies:

      a. Perform systematic simulations using different geometries and physical parameters to obtain the variation in morphologies as a function of parameters.

      b. Using either supervised training or unsupervised training (physics-informed neural networks, PINNs) to learn these characteristic morphologies and classify their dependence on the parameters using neural networks. These can then be trained to determine the possible range of geometrical and physical parameters that yield buckled patterns seen in the systematic simulations.

      c. Reconstruct the 3D surface from fossil endocasts. Using the well-trained neural network, it should be possible to predict the initial shape of the smooth brain cortex, growth profile, and stiffness ratio of the gray and white matter.

      As an example in this direction, supervised neural networks have been used recently to solve the forward problem to predict the buckling pattern of a growing two-layer system (Chavoshnejad et al., 2023). The inverse problem can then be solved using machine-learning methods when the training datasets are the folded shape, which are then used to predict the initial geometry and physical properties.

      Reference:

      Chavoshnejad, P., Chen, L., Yu, X., Hou, J., Filla, N., Zhu, D., Liu, T., Li, G., Razavi, M.J. and Wang, X., 2023. An integrated finite element method and machine learning algorithm for brain morphology prediction. Cerebral Cortex, 33(15), pp.9354-9366.

      Conclusion

      This is a well-executed and creative study that integrates diverse methodologies to address a longstanding question in developmental neurobiology. While a few aspects-such as regional folding peculiarities, sensitivity to initial conditions, and available human data-could be further elaborated, they do not detract from the overall quality and novelty of the work. I enthusiastically support this paper and believe that it will be of broad interest to the neuroscience, biomechanics, and developmental biology communities.

      Note: The paper mentions a companion paper [reference 11] that explores the cellular and anatomical changes in the ferret cortex. I did not have access to this manuscript, but judging from the title, this paper might further strengthen the conclusions.

      The companion paper (Choi et al., 2025) has also been submitted to eLife and can be found here:

      G. P. T. Choi, C. Liu, S. Yin, G. Séjourné, R. S. Smith, C. A. Walsh, L. Mahadevan, Biophysical basis for brain folding and misfolding patterns in ferrets and humans. eLife, 14, RP107141, 2025. doi:10.7554/eLife.107141

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      This study was conducted and presented to the highest methodological standards. It is clearly written, and the results are thoroughly presented in the main manuscript and supplementary materials. Nevertheless, I would present the following minor points and comments for consideration by the authors prior to finalizing their work:

      We thank the reviewer for positive opinions and helpful comments.

      (1) Where did the MRI-based cortical surface data come from? Specifically, it would be helpful to include more information regarding whether the surfaces were reconstructed based on individual- or group-level data. It appears the surfaces were group-level, and, if so, accounting for individual-level cortical folding may be a fruitful direction for future work.

      The surface data come from public database, which are stated in the Methods Section. “We used a publicly available database for all our 3d reconstructions: fetal macaque brain surfaces are obtained from Liu et al. (2020); newborn ferret brain surfaces are obtained from Choi et al. (2025); and fetal human brain surfaces are obtained from Tallinen et al. (2016).”

      These surfaces are reconstructed based on group-level data. Specifically, the macaque atlas images are constructed for brains at gestational ages of 85 days (G85, N \=18_, 9 females), 110 days (G110, _N \=10_, 7 females) and 135 days (G135, _N \=16_,_ 7 females). And yes, future work may focus on individual-level cortical folding, and we expect that more specific results could be found.

      (2) One methodological approach for assessing consistency of cortical folding within species might be an evaluation of cross-hemispheric symmetry. I would find this particularly interesting with respect to the gel models, as it could complement the quantification of variation with respect to the computationally derived and real surfaces.

      Yes, the cross-hemispheric symmetry comparison can be done by our morphometric analysis method. We have added the results of ferret brain’s left-right symmetry for gel models, simulations, and real surfaces in the supplementary material. A typical conformal mapping figure and the similarity index table are shown here.

      (3) Was there a specific reason to reorder the histogram plots in Figure 4c to macaque, ferret, human rather than to maintain the order presented in Figure 4a/b of ferret, macaque, human? I appreciate that this is a minor concern, and all subplots are indeed properly titled, but consistent order may improve clarity.

      We have reordered the histogram plots to make all the figure orders consistent.

      Reviewer #2 (Recommendations for the authors):

      (1) Please consider revising the caption of Figure 1 (or equivalent figures) to explicitly state whether features such as the macaque occipital flatness were reproduced or not.

      We thank the reviewer for pointing out the macaque occipital flatness.

      Author response table 1.

      Left-right similarity index evaluated by comparing the shape index of ferret brains, calculated with vector P-NORM p\=2,

      Author response image 1.

      Left-right similarity index of ferret brains

      Occipital Pole of the macaque remains relatively smooth in both real brains and computational models. But our main aim in this paper is to explore the large-scale folds formation, so the smooth region is not discussed in depth. But future work could include this to make the model more complete.

      (2) Some figures could benefit from clearer labelling to distinguish between in vivo, in vitro, and in silico results.

      We have supplemented some texts in panels to make the labelling clearer.

      (3) The manuscript would benefit from a short paragraph in the Discussion reflecting on how future incorporation of regional heterogeneities might improve model fidelity.

      We have added a sentence in the Discussion Section about improving the model fidelity by considering regional heterogeneities.

      “Future more accurate models incorporating spatio-temporal inhomogeneous growth profiles and mechanical properties, such as varying stiffness, would make the folding pattern closer to the real cortical folding. This relies on more in vivo measurements of the brain’s physical properties and cortical expansion.”

      (4) Suggestions for improved or additional experiments, data, or analyses.

      (5) Clarify and justify the selection of developmental stages: The authors should explain why particular gestational stages (e.g., G85 for macaque, GW23 for human) were chosen as starting points for the physical and computational models. A discussion of how sensitive the folding patterns are to the initial geometry would help assess the robustness of the model. If feasible, a brief sensitivity analysis-varying initial age or surface geometry-would strengthen the conclusions.

      The initial geometry is one of the important factors that decides the final folding pattern. The smooth brain in the early developmental stage shows a broad consistency across individuals, and we expect the main folds to form similarly across species and individuals.

      Generally, we choose the initial geometry when the brain cortex is still relatively smooth. For the human, this corresponds approximately to GW23, as the major folds such as the Rolandic fissure (central sulcus), arise during this developmental stage. For the macaque brain, we chose developmental stage G85, primarily because of the availability of the dataset corresponding to this time, which also corresponds to the least folded.

      We expect that large-scale folding patterns are strongly sensitive to the initial geometry but fine-scale features are not. Since our goal is to explain the large-scale features, we expect sensitivity to the initial shape.

      We have added the discussion about geometric sensitivity in the section Methods-Numerical Simulations: “Small perturbations on initial geometry would affect minor folds, but the main features of major folds, such as orientations, width, and depth, are expected to be conserved across individuals [49, 50]. For simplicity, we do not perturb the fetal brain geometry obtained from datasets.”

      (6) Explore parameter boundaries more explicitly: The paper would benefit from a clearer account of the ranges of mechanical and geometric parameters (e.g., growth ratios, cortical thickness) for which the model holds. Are there specific conditions under which the physical and numerical models diverge? Identifying breakdown points would help readers understand the model’s limitations and applicability.

      Exploring the valid parameter space is a key problem. We have tested a series of growth parameters and will state them explicitly in our revision. In the current version, we chose the ones that yield a relatively high similarity index to the animal brains. More generally, folding patterns are largely regulated by geometry as well as physical parameters, such as cortical thickness, modulus ratios, and growth ratios and inhomogeneity. In our previous work on a different system, gut morphogenesis, where similar folding patterns are seen, we have explored these features (Gill et al., 2024).

      (7) Address species-specific cortical peculiarities: A striking omission is the flat occipital pole of the macaque, which is not reproduced in the physical or computational models. Given its known anatomical and cellular distinctiveness, this discrepancy warrants discussion. Even if not explored experimentally, the authors could speculate on what developmental or mechanical conditions would be needed to reproduce such regional smoothness.

      Please refer to our answer to the public reviewer 2, question (3). From our results, the formation of smooth Occipital Pole might indicate that the spatio-temporal growth rate of gray and white matter are consistent in this region, such that there’s no much differential growth.

      (8) Consider integration of available human growth data: While the authors note the lack of spatiotemporal growth data across species, such datasets exist for human fetal brain development, including those from MRI and ultrasound studies (e.g., Nature 2023). Incorporating these into the human model-or at least discussing their implications-would enhance biological relevance.

      Yes, some datasets for fetal human brains have provided very comprehensive measurements on brain shapes at many developmental stages. This can surely be implemented in our current model by calculating the spatio-temporal growth rate from regional cortical shapes at different stages.

      (9) Recommendations for improving the writing and presentation:

      a) The manuscript is generally well-written, but certain sections would benefit from more explicit linksbetween the biological phenomena and the modeling framework. For instance, the Introduction and Discussion could more clearly articulate how mechanical principles interface with genetic or cellular processes, especially in the context of evolution and developmental variation.

      We have briefly discussed the gene-regulated cellular process and the induced changes of mechanical properties and growth rules in SI, table S1. In the main text, to be clearer, we have added a sentence:

      “Many malformations are related to gene-regulated abnormal cellular processes and mechanical properties, which are discussed in SI”

      b) The Discussion could better acknowledge limitations and future directions, including regional dif-ferences in folding, inter-individual variability, and the model’s assumptions of homogeneous material properties and growth.

      In the discussion section, we have pointed out four main limitations and open directions based on our current model, including the discussion on spatiotemporal growth and property. To be more complete, we have supplemented other limitations on the regional differences in folding and the interindividual variability. In the main text, we added the following sentence:

      “In addition to the homogeneity assumption, we have not investigated the inter-individual variability and regional differences in folding. More accurate and specific work is expected to focus on these directions.”

      c) The authors briefly mention the potential for addressing the inverse growth problem. Expanding this idea in a short paragraph - perhaps with hypothetical applications to fossil brain reconstructions-would broaden the paper’s appeal to evolutionary neuroscientists.

      We have stated general steps in the response to public reviewer 2, question (5).

      (10) Minor corrections to the text and figures:

      a) Figures:

      Label figures more clearly to distinguish between in vivo, in vitro, and in silico brain representations.– Ensure that the occipital pole of the macaque is visible or annotated, especially if it lacks the expected smoothness.

      Add scale bars where missing for clarity in morphometric comparisons.

      We thank the reviewer for suggestions to improve the readability of our manuscript.

      The in vivo (real), in vitro (gel), and in silico (simulated) results are both distinguished by their labels and different color scheme: gray-white for real brain, pink-white for gel model, and blue-white for simulations, respectively.

      The occipital pole of the macaque brain remains relatively smooth in our computational model but notin our physical gel model. We have clarified this in the main text: “To focus on fold formation, we did not discuss the relatively smooth region, such as the Occipital Pole of the macaque.”

      All the brain models are rescaled to the same size, where the distance between the anterior-most pointof the frontal lobe and the posterior-most point of the occipital lobe is two units.

      b) Text:

      Consider revising figure captions to explicitly mention whether specific regional features (e.g., flatoccipital pole) were observed or absent in models.

      In Table II (and relevant text), ensure parameter definitions are consistent and explained clearly for across-disciplinary audience.

      Add citations to recent human fetal growth imaging work (e.g., ultrasound-based studies) to support claims about available data.

      We have added some descriptions of the characters of the folding pattern in the caption of Figure 4,including major folds and smooth regions.

      “Three or four major folds of each brain model are highlighted and served as landmarks. The occipital pole region of macaque brains remains smooth in real and simulated brains.”

      We have clarified the definition of growth ratio gMsub>g</sub>/g<sub>w</sub> and stiffness ratio µ<sub>g</sub>/µ<sub>w</sub> between gray matter and white matter, and the normalized cortical thickness h/L in Table 2.

      We have referred to a high-quality dataset of fetal brain imaging work, the ultrasound-imaging method(Namburete et al. 2023), in our main text, Discussion:

      “...the effect of inhomogeneous growth needs to be further investigated by incorporating regional growth of the gray and white matter not only in human brains [29, 31] based on public datasets [45], but also in other species.”

    1. eLife Assessment

      This important study provides insights into the neurodevelopmental trajectories of structural and functional connectivity gradients in the human brain and their potential associations with behaviour and psychopathology. The evidence supporting the findings is solid. This study will be of interest to neuroscientists interested in understanding functional connectivity across development.

    2. Reviewer #2 (Public review):

      Summary:

      This study aims to show how structural and functional brain organization develops during childhood and adolescence using two large neuroimaging datasets. It addresses whether core principles of brain organization are stable across development, how they change over time, and how these changes relate to cognition and psychopathology. The study finds that brain organization is established early and remains stable but undergoes gradual refinement, particularly in higher-order networks. Structural-functional coupling is linked to better working memory but shows no clear relationship with psychopathology.

      Comments on revisions:

      Follow-up: I would like to thank the authors for their thoughtful and comprehensive revisions. The additional analyses addressing developmental differences in structure-function coupling between CALM and NKI are valuable and clearly strengthen the manuscript. I particularly appreciate the inclusion of the neurotypical subgroup within CALM to disentangle neurotypicality from potential site-related effects, as well as the expanded discussion of these findings in the context of individual variability and equifinality.

      Regarding my earlier comment on the use of COMBAT, I realize that "exclusion" may have been a poor choice of wording. What I meant was that harmonization procedures like COMBAT can, in some cases, weaken extremes or reduce variability by shrinking values toward the mean, rather than literally excluding participants from the analysis. Nevertheless, I appreciate the authors' careful consideration of this point and their additional analysis examining sample coverage following motion-based exclusions.

      Overall, I am satisfied with the revisions, and I believe the manuscript has been substantially improved.

    3. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Lack of Sensitivity Analyses for some Key Methodological Decisions: Certain methodological choices in this manuscript diverge from approaches used in previous works. In these cases, I recommend the following: (i) The authors could provide a clear and detailed justification for these deviations from established methods, and (ii) supplementary sensitivity analyses could be included to ensure the robustness of the findings, demonstrating that the results are not driven primarily by these methodological changes. Below, I outline the main areas where such evaluations are needed:

      This detailed guidance is incredibly valuable, and we are grateful. Work of this kind is in its relative infancy, and there are so many design choices depending on the data available, questions being addressed, and so on. Help us navigate that has been extremely useful. In our revised manuscript we are very happy to add additional justification for design choices made, and wherever possible test the impact of those choices. It is certainly the case that different approaches have been used across the handful of papers published in this space, and, unlike in other areas of systems neuroscience, we have yet to reach the point where any of these approaches are established. We agree with the reviewer that wherever possible these design choices should be tested. 

      Use of Communicability Matrices for Structural Connectivity Gradients: The authors chose to construct structural connectivity gradients using communicability matrices, arguing that diffusion map embedding "requires a smooth, fully connected matrix." However, by definition, the creation of the affinity matrix already involves smoothing and ensures full connectedness. I recommend that the authors include an analysis of what happens when the communicability matrix step is omitted. This sensitivity test is crucial, as it would help determine whether the main findings hold under a simpler construction of the affinity matrix. If the results significantly change, it could indicate that the observations are sensitive to this design choice, thereby raising concerns about the robustness of the conclusions. Additionally, if the concern is related to the large range of weights in the raw structural connectivity (SC) matrix, a more conventional approach is to apply a log-transformation to the SC weights (e.g., log(1+𝑆𝐶<sub>𝑖𝑗</sub>)), which may yield a more reliable affinity matrix without the need for communicability measures.

      The reason we used communicability is indeed partly because we wanted to guarantee a smooth fully connected matrix, but also because our end goal for this project was to explore structure-function coupling in these low-dimensional manifolds.  Structural communicability – like standard metrics of functional connectivity – includes both direct and indirect pathways, whereas streamline counts only capture direct communication. In essence we wanted to capture not only how information might be routed from one location to another, but also the more likely situation in which information propagates through the system. 

      In the revised manuscript we have given a clearer justification for why we wanted to use communicability as our structural measure (Page 4, Line 179):

      “To capture both direct and indirect paths of connectivity and communication, we generated weighted communicability matrices using SIFT2-weighted fibre bundle capacity (FBC). These communicability matrices reflect a graph theory measure of information transfer previously shown to maximally predict functional connectivity (Esfahlani et al., 2022; Seguin et al., 2022). This also foreshadowed our structure-function coupling analyses, whereby network communication models have been shown to increase coupling strength relative to streamline counts (Seguin et al., 2020)”.

      We have also referred the reader to a new section of the Results that includes the structural gradients based on the streamline counts (Page 7, line 316):

      “Finally, as a sensitivity analysis, to determine the effect of communicability on the gradients, we derived affinity matrices for both datasets using a simpler measure: the log of raw streamline counts. The first 3 components derived from streamline counts compared to communicability were highly consistent across both NKI  (r<sub>s</sub> = 0.791, r<sub>s</sub> = 0.866, r<sub>s</sub> = 0.761) and the referred subset of CALM (r<sub>s</sub> = 0.951, r<sub>s</sub> = 0.809, r<sub>s</sub> = 0.861), suggesting that in practice the organisational gradients are highly similar regardless of the SC metric used to construct the affinity matrices”. 

      Methodological ambiguity/lack of clarity in the description of certain evaluation steps: Some aspects of the manuscript’s methodological description are ambiguous, making it challenging for future readers to fully reproduce the analyses based on the information provided. I believe the following sections would benefit from additional detail and clarification:

      Computation of Manifold Eccentricity: The description of how eccentricity was computed (both in the results and methods sections) is unclear and may be problematic. The main ambiguity lies in how the group manifold origin was defined or computed. (1) In the results section, it appears that separate manifold origins were calculated for the NKI and CALM groups, suggesting a dataset-specific approach. (2) Conversely, the methods section implies that a single manifold origin was obtained by somehow combining the group origins across the three datasets, which seems contradictory. Moreover, including neurodivergent individuals in defining the central group manifold origin in conceptually problematic. Given that neurodivergent participants might exhibit atypical brain organization, as suggested by Figure 1, this inclusion could skew the definition of what should represent a typical or normative brain manifold. A more appropriate approach might involve constructing the group manifold origin using only the neurotypical participants from both the NKI and CALM datasets. Given the reported similarity between group-level manifolds of neurotypical individuals in CALM and NKI, it would be reasonable to expect that this combined origin should be close to the origin computed within neurotypical samples of either NKI or CALM. As a sanity check, I recommend reporting the distance of the combined neurotypical manifold origin to the centres of the neurotypical manifolds in each dataset. Moreover, if the manifold origin was constructed while utilizing all samples (including neurodivergent samples) I think this needs to be reconsidered. 

      This is a great point, and we are very happy to clarify. Separate manifolds were calculated for the NKI and CALM participants, hence a dataset-specific approach. Indeed, in the long-run our goal was to explore individual differences in these manifolds, relative to the respective group-level origins, and their intersection across modalities, so manifold eccentricity was calculated at an individual level for subsequent analyses. At the group level, for each modality, we computed 3 manifold origins: one for NKI, one for the referred subset of CALM, and another for the neurotypical portion of CALM. Crucially, because the manifolds are always normal, in each case the manifold origin point is near-zero (extremely near-zero, to the 6<sup>th</sup> or 7<sup>th</sup> decimal place). In other words, we do indeed calculate the origin separately each time we calculate the gradients, but the origin is zero in every case. As a result, differences in the origin point cannot be the source of any differences we observe in manifold eccentricity between groups or individuals. We have updated the Methods section with the manifold origin points for each dataset and clarified our rationale (Page 16, Line 1296):

      “Note that we used a dataset-specific approach when we computed manifold eccentricity for each of the three groups relative to their group-level origin: neurotypical CALM (SC origin = -7.698 x 10<sup>-7</sup>, FC origin = 6.724 x 10<sup>-7</sup>), neurodivergent CALM (SC origin = -6.422 x 10 , FC origin = 1.363 x 10 ), and NKI (SC origin = -7.434 x 10 , FC origin = 4.308 x 10<sup>-6</sup>). Eccentricity is a relative measure and thus normalised relative to the origin. Because of this normalisation, each time gradients are constructed the manifold origin is necessarily near-zero, meaning that differences in manifold eccentricity of individual nodes, either between groups or individuals, are stem from the eccentricity of that node rather than a difference in origin point”. 

      We clarified the computation of the respective manifold origins within the Results section, and referred the reader to the relevant Methods section (Page 9, line 446):

      “For each modality (2 levels: SC and FC) and dataset (3 levels: neurotypical CALM, neurodivergent CALM, and NKI), we computed the group manifold origin as the mean of their respective first three gradients. Because of the normal nature of the manifolds this necessarily means that these origin points will be very near-zero, but we include the exact values in the ‘Manifold Eccentricity’ methodology sub-section”. 

      Individual-Level Gradients vs. Group-Level Gradients: Unlike previous studies that examined alterations in principal gradients (e.g., Xia et al., 2022; Dong et al., 2021), this manuscript focuses on gradients derived directly from individual-level data. In contrast, earlier works have typically computed gradients based on grouped data, such as using a moving window of individuals based on age (Xia et al.) or evaluating two distinct age groups (Dong et al.). I believe it is essential to assess the sensitivity of the findings to this methodological choice. Such an evaluation could clarify whether the observed discrepancies with previous reports are due to true biological differences or simply a result of different analytical strategies.

      This is a brilliant point. The central purpose of our project was to test how individual differences in these gradients, and their intersection across modalities, related to differences in phenotype (e.g. cognitive difficulties). These necessitated calculating gradients at the level of individuals and building a pipeline to do so, given that we could find no other examples. Nonetheless, despite this different goal and thus approach, we had expected to replicate a couple of other key findings, most prominently the ‘swapping’ of gradients shown by Dong et al. (2021). We were also surprised that we did not find this changing in order. The reviewer is right and there could be several design features that produce the difference, and in the revised manuscript we test several of them. We have added the following text to the manuscript as a sensitivity analysis for the Results sub-section titled “Stability of individual-level gradients across developmental time” (Page 7, Line 344 onwards):

      “One possibility is that our observation of gradient stability – rather than a swapping of the order for the first two gradients (Dong et al., 2021) – is because we calculated them at an individual level. To test this, we created subgroups and contrasted the first two group-level structural and functional gradients derived from children (younger than 12 years old) versus those from adolescents (12 years old and above), using the same age groupings as prior work (Dong et al., 2021). If our use of individually calculated gradients produces the stability, then we should observe the swapping of gradients in this sensitivity analysis. Using baseline scans from NKI, the primary structural gradient in childhood (N = 99) as shown in Figure 1f, this was highly correlated (r<sub>s</sub> = 0.995) with those derived from adolescents (N = 123). Likewise, the secondary structural gradient in childhood was highly consistent in adolescence (r<sub>s</sub> = 0.988). In terms of functional connectivity, the principal gradient in childhood (N = 88) was highly consistent in adolescence (r<sub>s</sub> = 0.990, N = 125). The secondary gradient in childhood was again highly similar in adolescence (r<sub>s</sub> = 0.984). The same result occurred in the CALM dataset: In the baseline referred subset of CALM, the primary and secondary communicability gradients derived from children (N = 258) and adolescents (N = 53) were near-identical (r<sub>s</sub> = 0.991 and r<sub>s</sub> = 0.967, respectively). Alignment for the primary and secondary functional gradients derived from children (N = 130) and adolescents (N = 43) were also near-identical (r<sub>s</sub> = 0.972 and r<sub>s</sub> = 0.983, respectively). These consistencies across development suggest that gradients of communicability and functional connectivity established in childhood are the same as those in adolescence, irrespective of group-level or individual-level analysis. Put simply, our failure to replicate the swapping of gradient order in Dong et al. (2021) is not the result of calculating gradients at the level of individual participants.”

      Procrustes Transformation: It is unclear why the authors opted to include a Procrustes transformation in this analysis, especially given that previous related studies (e.g., Dong et al.) did not apply this step. I believe it is crucial to evaluate whether this methodological choice influences the results, particularly in the context of developmental changes in organizational gradients. Specifically, the Procrustes transformation may maximize alignment to the group-level gradients, potentially masking individual-level differences. This could result in a reordering of the gradients (e.g., swapping the first and second gradients), which might obscure true developmental alterations. It would be informative to include an analysis showing the impact of performing vs. omitting the Procrustes transformation, as this could help clarify whether the observed effects are robust or an artifact of the alignment procedure. (Please also refer to my comment on adding a subplot to Figure 1). Additionally, clarifying how exactly the transformation was applied to align gradients across hemispheres, individuals, and/or datasets would help resolve ambiguity. 

      The current study investigated individual differences in connectome organisation, rather than group-level trends (Dong et al., 2021). This necessitates aligning individual gradients to the corresponding group-level template using a Procrustes rotation. Without a rotation, there is no way of knowing if you are comparing  ‘like with like’: the manifold eccentricity of a given node may appear to change across individuals simply due to subtle differences in the arbitrary orientation of the underlying manifolds. We also note that prior work examining individual differences in principal alignment have used Procrustes (Xia et al., 2022), who demonstrated emergence of the principal gradient across development, albeit with much smaller effects than Dong and colleagues (2021). Nonetheless, we agree, the Procrustes rotation could be another source of the differences we observed with the previous paper (Dong et al. 2021). We explored the impact of the Procrustes rotation on individual gradients as our next sensitivity analysis. We recalculated everyone’s gradients without Procrustes rotation. We then tested the alignment of each participant with the group-level gradients using Spearman’s correlations, followed by a series of generalised linear models to predict principal gradient alignment using head motion, age, and sex. The expected swapping of the first and second functional gradient (Dong et al., 2021) would be represented by a decrease in the spatial similarity of each child’s principal functional gradient to the principal childhood group-level gradient, at the onset of adolescence (~age 12). However, there is no age effect on this unrotated alignment, suggesting that the lack of gradient swapping in our data does not appear to be the result of the Procrustes rotation. When you use unrotated individual gradients the alignment is remarkably consistent across childhood and adolescence. Alignment is, however, related to head motion, which is often related to age. To emphasise the importance of motion, particularly in relation to development, we conducted a mediation analysis between the relationship between age and principal alignment (without correcting for motion), with motion as a mediator, within the NKI dataset. Before accounting for motion, the relationship between age and principal alignment is significant, but this can be entirely accounted for by motion. In our revised manuscript we have included this additional analysis in the Results sub-section titled “Stability of individual-level gradients across developmental time”, following on from the above point about the effect of group-level versus individual-level analysis (Page 8, Line 400):

      “A second possible discrepancy between our results and that of prior work examining developmental change in group-level functional gradients (Dong et al., 2021) was the use of Procrustes alignment. Such alignment of individual-level gradients to group-level templates is a necessary step to ensure valid comparisons between corresponding gradients across individuals, and has been implemented in sliding-window developmental work tracking functional gradient development (Xia et al., 2022). Nonetheless, we tested whether our observation of stable principal functional and communicability gradients may be an artefact of the Procrustes rotation. We did this by modelling how individual-level alignment without Procrustes rotation to the group-level templates varies with age, head motion, and sex, as a series of generalised linear models. We included head motion as the magnitude of the Procrustes rotation has been shown to be positively correlated with mean framewise displacement (Sasse et al., 2024), and prior group-level work (Dong et al., 2021) included an absolute motion threshold rather than continuous motion estimates. Using the baseline referred CALM sample, there was no significant relationship between alignment and age (β = -0.044, 95% CI = [-0.154, 0.066], p = 0.432) after accounting for head motion and sex. Interestingly, however head motion was significantly associated with alignment ( β = -0.318, 95% CI = [-0.428, -.207], p = 1.731 x 10<sup>-8</sup>), such that greater head motion was linked to weaker alignment. Note that older children tended to have exhibit less motion for their structural scans (r<sub>s</sub> = 0.335, p < 0.001). We observed similar trends in functional alignment, whereby tighter alignment was significantly predicted by lower head motion (β = -0.370, 95% CI = [-0.509, -0.231], p = 1.857 x 10<sup>-7</sup>), but not by age (β= 0.049, 95% CI = [-0.090, 0.187], p = 0.490). Note that age and head motion for functional scans were not significantly related (r<sub>s</sub> = -0.112, p = 0.137). When repeated for the baseline scans of NKI, alignment with the principal structural gradient was not significantly predicted by either scan age (β = 0.019, 95% CI = [-0.124, 0.163], p = 0.792) or head motion (β = -0.133, 95% CI = [-0.175, 0.009], p = 0.067) together in a single model, where age and motion were negatively correlated (r<sub>s</sub> = -0.355, p < 0.001). Alignment with the principal functional gradient was significantly predicted by head motion (β = -0.183, 95% CI = [-0.329, -0.036], p = 0.014) but not by age (β= 0.066, 95% CI = [-0.081, 0.213], p = 0.377), where age and motion were also negatively correlated (r<sub>s</sub> = -0.412, p < 0.001). Across modalities and datasets, alignment with the principal functional gradient in NKI was the only example in which there was a significant correlation between alignment and age (r<sub>s</sub> = 0.164, p = 0.017) before accounting for head motion and sex. This suggests that apparent developmental effects on alignment are minimal, and where they do exist they are removed after accounting for head motion. Put together this suggests that the lack of order swapping for the first two gradients is not the result of the Procrustes rotation – even without the rotation there is no evidence for swapping”.

      “To emphasise the importance of head motion in the appearance of developmental change in alignment, we examined whether accounting for head motion removes any apparent developmental change within NKI. Specifically, we tested whether head motion mediates the relationship between age and alignment (Figure 1X), controlling for sex, given that higher motion is associated with younger children (β= -0.429, 95% CI = [0.552, -0.305], p = 7.957 x 10<sup>-11</sup>), and stronger alignment is associated with reduced motion (β = -0.211, 95% CI = [-0.344, -0.078], p = 2.017 x 10<sup>-3</sup>). Motion mediated the relationship between age and alignment (β = 0.078, 95% CI = [0.006, 0.146], p = 1.200 x 10<sup>-2</sup>), accounting for 38.5% variance in the age-alignment relationship, such that the link between age and alignment became non-significant after accounting for motion (β = 0.066, 95% CI = [-0.081, 0.214], p = 0.378). This firstly confirms our GLM analyses, where we control for motion and find no age associations. Moreover, this suggests that caution is required when associations between age and gradients are observed. In our analyses, because we calculate individual gradients, we can correct for individual differences in head motion in all our analyses. However, other than using an absolute motion threshold and motion-matched child and adolescent groups, individual differences in motion were not accounted for by prior work which demonstrated a flipping of the principal functional gradients with age (Dong et al., 2021)”. 

      We further clarify the use of Procrustes rotation as a separate sub-section within the Methods (Page 25, Line 1273):

      “Procrustes Rotation

      For group-level analysis, for each hemisphere we constructed an affinity matrix using a normalized angle kernel and applied diffusion-map embedding. The left hemisphere was then aligned to the right using a Procrustes rotation. For individual-level analysis, eigenvectors for the left hemisphere were aligned with the corresponding group-level rotated eigenvectors. No alignment was applied across datasets. The only exception to this was for structural gradients derived from the referred CALM cohort. Specifically, we aligned the principal gradient of the left hemisphere to the secondary gradient of the right hemisphere: this was due to the first and second gradients explaining a very similar amount of variance, and hence their order was switched”. 

      SC-FC Coupling Metric: The approach used to quantify nodal SC-FC coupling in this study appears to deviate from previously established methods in the field. The manuscript describes coupling as the "Spearman-rank correlation between Euclidean distances between each node and all others within structural and functional manifolds," but this description is unclear and lacks sufficient detail. Furthermore, this differs from what is typically referred to as SC-FC coupling in the literature. For instance, the cited study by Park et al. (2022) utilizes a multiple linear regression framework, where communicability, Euclidean distance, and shortest path length are independent variables predicting functional connectivity (FC), with the adjusted R-squared score serving as the coupling index for each node. On the other hand, the Baum et al. (2020) study, also cited, uses Spearman correlation, but between raw structural connectivity (SC) and FC values. If the authors opt to introduce a novel coupling metric, it is essential to demonstrate its similarity to these previous indices. I recommend providing an analysis (supplementary) showing the correlation between their chosen metric and those used in previous studies (e.g., the adjusted R-squared scores from Park et al. or the SC-FC correlation from Baum et al.). Furthermore, if the metrics are not similar and results are sensitive to this alternative metric, it raises concerns about the robustness of the findings. A sensitivity analysis would therefore be helpful (in case the novel coupling metric is not like previous ones) to determine whether the reported effects hold true across different coupling indices.

      This is a great point, and we are happy to take the reviewer’s recommendation. There are multiple different ways of calculating structure-function coupling. For our set of questions, it was important that our metric incorporated information about the structural and functional manifolds, rather than being a separate approach that is unrelated to these low-dimensional embeddings. Put simply, we wanted our coupling measure to be about the manifolds and gradients outlined in the early sections of the results. We note that the multiple linear regression framework was developed by Vázquez-Rodríguez and colleagues (2019), whilst the structure-function coupling computed in manifold space by Park and colleagues (2022) was operationalised as a linear correlation between z-transformed functional connectomes and structural differentiation eigenvectors. To clarify how this coupling was calculated, and to justify why we developed a new coupling method based on manifolds rather than borrow an existing approach from the literature, we have revised the manuscript to make this far clearer for readers (Page 13, line 604):

      “To examine the relationship between each node’s relative position in structural and functional manifold space, we turned our attention to structure-function coupling. Whilst prior work typically computed coupling using raw streamline counts and functional connectivity matrices, either as a correlation (Baum et al., 2020) or through a multiple linear regression framework (Vázquez-Rodríguez et al., 2019), we opted to directly incorporate low-dimensional embeddings within our coupling framework. Specifically, as opposed to correlating row-wise raw functional connectivity with structural connectivity eigenvectors (Park et al., 2022), our metric directly incorporates the relative position of each node in low-dimensional structural and functional manifold spaces. Each node was situated in a low-dimensional 3D space, the axes of which were each participant’s gradients, specific to each modality. For each participant and each node, we computed the Euclidean distance with all other nodes within structural and functional manifolds separately, producing a vector of size 200 x 1 per modality. The nodal coupling coefficient was the Spearman correlation between each node’s Euclidean distance to all other nodes in structural manifold space, and that in functional manifold space. Put simply, a strong nodal coupling coefficient suggests that that node occupies a similar location in structural space, relative to all other nodes, as it does in functional space”. 

      We also agree with the reviewer’s recommendation to compare this to some of the more standard ways of calculating coupling. We compare our metric with 3 others (Baum et al., 2020; Park et al., 2022; VázquezRodríguez et al., 2019), and find that all metrics capture the core developmental sensorimotor-to-association axis (Sydnor et al., 2021). Interestingly, manifold-based coupling measures captured this axis more strongly than non-manifold measures. We have updated the Results accordingly (Page 14, Line 638):

      “To evaluate our novel coupling metric, we compared its cortical spatial distribution to three others (Baum et al., 2020; Park et al., 2022; Vázquez-Rodríguez et al., 2019), using the group-level thresholded structural and functional connectomes from the referred CALM cohort. As shown in Figure 4c, our novel metric was moderately positively correlated to that of a multi-linear regression framework (r<sub>s</sub> = 0.494, p<sub>spin</sub> = 0.004; Vázquez-Rodríguez et al., 2019) and nodal correlations of streamline counts and functional connectivity (r<sub>s</sub> = 0.470, p<sub>spin</sub> = 0.005; Baum et al., 2020). As expected, our novel metric was strongly positively correlated to the manifold-derived coupling measure (r<sub>s</sub> = 0.661, p<sub>spin</sub> < 0.001; Park et al., 2022), more so than the first (Z(198) = 3.669, p < 0.001) and second measure (Z(198) = 4.012, p < 0.001). Structure-function coupling is thought to be patterned along a sensorimotor-association axis (Sydnor et al., 2021): all four metrics displayed weak-tomoderate alignment (Figure 4c). Interestingly, the manifold-based measures appeared most strongly aligned with the sensorimotor-association axis: the novel metric was more strongly aligned than the multi-linear regression framework (Z(198) = -11.564, p < 0.001) and the raw connectomic nodal correlation approach (Z(198) = -10.724, p < 0.001), but the previously-implemented structural manifold approach was more strongly aligned than the novel metric  (Z(198) = -12.242, p < 0.001). This suggests that our novel metric exhibits the expected spatial distribution of structure-function coupling, and the manifold approach more accurately recapitulates the sensorimotor-association axis than approaches based on raw connectomic measures”.

      We also added the following to the legend of Figure 4 on page 15:

      “d. The inset Spearman correlation plot of the 4 coupling measures shows moderate-to-strong correlations (p<sub>spin</sub> < 0.005 for all spatial correlations). The accompanying lollypop plot shows the alignment between the sensorimotor-to-association axis and each of the 4 coupling measures, with the novel measure coloured in light purple (p<sub>spin</sub> < 0.007 for all spatial correlations)”. 

      Prediction vs. Association Analysis: The term “prediction” is used throughout the manuscript to describe what appear to be in-sample association tests. This terminology may be misleading, as prediction generally implies an out-of-sample evaluation where models trained on a subset of data are tested on a separate, unseen dataset. If the goal of the analyses is to assess associations rather than make true predictions, I recommend refraining from the term “prediction” and instead clarifying the nature of the analysis. Alternatively, if prediction is indeed the intended aim (which would be more compelling), I suggest conducting the evaluations using a k-fold cross-validation framework. This would involve training the Generalized Additive Mixed Models (GAMMs) on a portion of the data and training their predictive accuracy on a held-out sample (i.e. different individuals). Additionally, the current design appears to focus on predicting SC-FC coupling using cognitive or pathological dimensions. This is contrary to the more conventional approach of predicting behavioural or pathological outcomes from brain markers like coupling. Could the authors clarify why this reverse direction of analysis was chosen? Understanding this choice is crucial, as it impacts the interpretation and potential implications of the findings. 

      We have replaced “prediction” with “association” across the manuscript. However, for analyses corresponding to Figure 5, which we believe to be the most compelling, we conducted a stratified 5-fold cross-validation procedure, outlined below, repeated 100 times to account for random variation in the train-test splits. To assess whether prediction accuracy in the test splits was significantly greater than chance, we compared our results to those derived from a null dataset in which cognitive factor 2 scores had been permuted across participants. To account for the time-series element and block design of our data, in that some participants had 2 or more observations, we permuted entire participant blocks of cognitive factor 2 scores, keeping all other variables, including covariates, the same. Included in our manuscript are methodological details and results pertaining to this procedure. Specifically, the following has been added to the Results (Page 16, Line 758):

      “To examine the predictive value of the second cognitive factor for global and network-level structure-function coupling, operationalised as a Spearman rank correlation coefficient, we implemented a stratified 5-fold crossvalidation framework, and predictive accuracy compared with that of a null data frame with cognitive factor 2 scores permuted across participant blocks (see ‘GAMM cross-validation’ in the Methods). This procedure was repeated 100 times to account for randomness in the train-test splits, using the same model specification as above. Therefore, for each of the 5 network partitions in which an interaction between the second cognitive factor and age was a significant predictor of structure-function coupling (global, visual, somato-motor, dorsal attention, and default-mode), we conducted a Welch’s independent-sample t-test to compare 500 empirical prediction accuracies with 500 null prediction accuracies. Across all 5 network partitions, predictive accuracy of coupling was significantly higher than that of models trained on permuted cognitive factor 2 scores (all p < 0.001). We observed the largest difference between empirical (M = 0.029, SD = 0.076) and null (M = -0.052, SD = 0.087) prediction accuracy in the somato-motor network [t (980.791) = 15.748, p < 0.001, Cohen’s d = 0.996], and the smallest difference between empirical (M = 0.080, SD = 0.082) and null (M = 0.047, SD = 0.081) prediction accuracy in the dorsal attention network [t (997.720) = 6.378, p < 0.001, Cohen’s d = 0.403]. To compare relative prediction accuracies, we ordered networks by descending mean accuracy and conducted a series of Welch’s independent sample t-tests, followed by FDR correction (Figure 5X). Prediction accuracy was highest in the default-mode network (M = 0.265, SD = 0.085), two-fold that of global coupling (t(992.824) = 25.777, p<sub>FDR</sub> = 5.457 x 10<sup>-112</sup>, Cohen’s d = 1.630, M = 0.131, SD = 0.079). Global prediction accuracy was significantly higher than the visual network (t (992.644) = 9.273, p<sub>FDR</sub> = 1.462 x 10<sup>-19</sup>, Cohen’s d = 0.586, M = 0.083, SD = 0.085), but visual prediction accuracy was not significantly higher than within the dorsal attention network (t (997.064) = 0.554, p<sub>FDR</sub> = 0.580, Cohen’s d = 0.035, M = 0.080, SD = 0.082). Finally, prediction accuracy within the dorsal attention network was significantly stronger than that of the somato-motor network [t (991.566) = 10.158, p<sub>FDR</sub> = 7.879 x 10<sup>-23</sup>, Cohen’s d = 0.642 M = 0.029, SD = 0.076]. Together, this suggests that out-of-sample developmental predictive accuracy for structure-function coupling, using the second cognitive factor, is strongest in the higher-order default-mode network, and lowest in the lower-order somatosensory network”. 

      We have added a separate section for GAMM cross-validation in the Methods (Page 27, Line 1361):

      GAMM cross-validation

      “We implemented a 5-fold cross validation procedure, stratified by dataset (2 levels: CALM or NKI). All observations from any given participant were assigned to either the testing or training fold, to prevent data leakage, and the cross-validation procedure was repeated 100 times, to account for randomness in data splits. The outcome was predicted global or network-level structure-function coupling across all test splits, operationalised as the Spearman rank correlation coefficient. To assess whether prediction accuracy exceeded chance, we compared empirical prediction accuracy with that of GAMMs trained and tested on null data in which cognitive factor 2 scores were permuted across subjects. The number of observations formed 3 exchangeability blocks (N = 320 with one observation, N = 105 with two observations, and N = 33 with three observations), whereby scores from a participant with two observations were replaced by scores from another participant with two observations, with participant-level scores kept together, and so on for all numbers of observations. We compared empirical and null prediction accuracies using independent sample t-tests as, although the same participants were examined, the shuffling meant that the relative ordering of participants within both distributions was not preserved. For parallelisation and better stability when estimating models fit on permuted data, we used the bam function from the mgcv R package (Wood, 2017)”. 

      We also added a justification for why we predicted coupling using behaviour or psychopathology, rather than vice versa (Page 27, Line 1349):

      “When using our GAMMs to test for the relationship between cognition and psychopathology and our coupling metrics, we opted to predict structure-function coupling using cognitive or psychopathological dimensions, rather than vice versa, to minimise multiple comparisons. In the current framework, we corrected for 8 multiple comparisons within each domain. This would have increased to 16 multiple comparison corrections for predicting two cognitive dimensions using network-level coupling, and 24 multiple comparison corrections for predicting three psychopathology dimensions. Incorporating multiple networks as predictors within the same regression framework introduces collinearity, whilst the behavioural dimensions were orthogonal: for example, coupling is strongly correlated between the somato-motor and ventral attention networks (r<sub>s</sub> = 0.721), between the default-mode and frontoparietal networks (r<sub>s</sub> = 0.670), and between the dorsal attention and fronto-parietal networks (r<sub>s</sub> = 0.650)”. 

      Finally, we noticed a rounding error in the ages of the data frame containing the structure-function coupling values and the cognitive/psychopathology dimensions. We rectified this and replaced the GAMM results, which largely remained the same. 

      In typical applications of diffusion map embedding, sparsification (e.g., retaining only the top 10  of the strongest connections) is often employed at the vertex-level resolution to ensure computational feasibility. However, since the present study performs the embedding at the level of 200 brain regions (a considerably coarser resolution), this step may not be necessary or justifiable. Specifically, for FC, it might be more appropriate to retain all positive connections rather than applying sparsification, which could inadvertently eliminate valuable information about lower-strength connections. Whereas for SC, as the values are strictly non-negative, retaining all connections should be feasible and would provide a more complete representation of the structural connectivity patterns. Given this, it would be helpful if the authors could clarify why they chose to include sparsification despite the coarser regional resolution, and whether they considered this alternative approach (using all available positive connections for FC and all non-zero values for SC). It would be interesting if the authors could provide their thoughts on whether the decision to run evaluations at the resolution of brain regions could itself impact the functional and structural manifolds, their alteration with age, and or their stability (in contrast to Dong et al. which tested alterations in highresolution gradients).

      This is another great point. We could retain all connections, but we usually implement some form of sparsification to reduce noise, particularly in the case of functional connectivity. But we nonetheless agree with the reviewer’s point. We should check what impact this is having on the analysis. In brief, we found minimal effects of thresholding, suggesting that the strongest connections are driving the gradient (Page 7, Line 304):

      “To assess the effect of sparsity on the derived gradients, we examined group-level structural (N = 222) and functional (N = 213) connectomes from the baseline session of NKI. The first three functional connectivity gradients derived using the full connectivity matrix (density = 92%) were highly consistent with those obtained from retaining the strongest 10% of connections in each row (r<sub>1</sub> = 0.999, r<sub>2</sub> = 0.998, r<sub>3</sub> < 0.999, all p < 0.001). Likewise, the first three communicability gradients derived from retaining all streamline counts (density = 83%) were almost identical to those obtained from 10% row-wise thresholding (r<sub>1</sub> = 0.994, r<sub>2</sub> = 0.963, r<sub>3</sub> = 0.955, all p < 0.001). This suggests that the reported gradients are driven by the strongest or most consistent connections within the connectomes, with minimal additional information provided by weaker connections. In terms of functional connectivity, such consistency reinforces past work demonstrating that the sensorimotor-toassociation axis, the major axis within the principal functional connectivity gradient, emerges across both the top- and bottom-ranked functional connections (Nenning et al., 2023)”.

      Furthermore, we appreciate the nudge to share our thoughts on whether the difference between vertex versus nodal metrics could be important here, particularly regarding thresholds. To combine this point with R2’s recommendation to expand the Discussion, we have added the following paragraph (Page 19, Line 861): 

      “We consider the role of thresholding, cortical resolution, and head motion as avenues to reconcile the present results with select reports in the literature (Dong et al., 2021; Xia et al., 2022). We would suggest that thresholding has a greater effect on vertex-level data, rather than parcel-level. For example, a recent study revealed that the emergence of principal vertex-level functional connectivity gradients in childhood and adolescence are indeed threshold-dependent (Dong et al., 2024). Specifically, the characteristic unimodal organisation for children and transmodal organisation for adolescents only emerged at the 90% threshold: a 95% threshold produced a unimodal organisation in both groups, whilst an 85% threshold produced a transmodal organisation in both groups. Put simply, the ‘swapping’ of gradient orders only occurs at certain thresholds. Furthermore, our results are not necessarily contradictory to this prior report (Dong et al., 2021): developmental changes in high-resolution gradients may be supported by a stable low-dimensional coarse manifold. Indeed, our decision to use parcellated connectomes was partly driven by recent work which demonstrated that vertex-level functional gradients may be derived using biologically-plausible but random data with sufficient spatial smoothing, whilst this effect is minimal at coarser resolutions (Watson & Andrews, 2023). We observed a gradual increase in the variance of individual connectomes accounted for by the principal functional connectivity gradient in the referred subset of CALM, in line with prior vertex-level work demonstrating a gradual emergence of the sensorimotor-association axis as the principal axis of connectivity (Xia et al., 2022), as opposed to a sudden shift. It is also possible that vertex-level data is more prone to motion artefacts in the context of developmental work. Transitioning from vertex-level to parcel-level data involves smoothing over short-range connectivity, thus greater variability in short-range connectivity can be observed in vertex-level data. However, motion artefacts are known to increase short-range connectivity and decrease long-range connectivity, mimicking developmental changes (Satterthwaite et al., 2013). Thus, whilst vertexlevel data offers greater spatial resolution in representation of short-range connectivity relative to parcel-level data, it is possible that this may come at the cost of making our estimates of the gradients more prone to motion”.

      Evaluating the consistency of gradients across development: the results shown in Figure 1e are used as evidence suggesting that gradients are consistent across ages. However, I believe additional analyses are required to identify potential sources of the observed inconsistency compared to previous works. The claim that the principal gradient explains a similar degree of variance across ages does not necessarily imply that the spatial structure remains the same. The observed variance explanation is hence not enough to ascertain inconsistency with findings from Dong et al., as the spatial configuration of gradients may still change over time. I suggest the following additional analyses to strengthen this claim. Alignment to group-level gradients: Assess how much of the variance in individual FC matrices is explained by each of the group-level gradients (G1, G2, and G3, for both FC and SC). This analysis could be visualized similarly to Figure 1e, with age on the x-axis and variance explained on the y-axis. If the explained variance varies as a function of age, it may indicate that the gradients are not as consistent as currently suggested. 

      This is another great suggestion. In the additional analyses above (new group-level analyses and unrotated gradient analyses) we rule-out a couple of the potential causes of the different developmental trends we observe in our data – namely the stability of the gradients over time. The suggested additional analysis is a great idea, and we have implemented it as follows (Page 8, Line 363):

      “To evaluate the consistency of gradients across development, across baseline participants with functional connectomes from the referred CALM cohort (N = 177), we calculated the proportion of variance in individuallevel connectomes accounted for by group-level functional gradients. Specifically, we calculated the proportion of variance in an adjacency matrix A accounted for by the vector v<sub>i</sub> as the fraction of the square of the scalar projection of v<sub>i</sub> onto A, over the Frobenius norm of A. Using a generalised linear model, we then tested whether the proportion of variance explained varies systematically with age, controlling for sex and headmotion. The variance in individual-level functional connectomes accounted for by the group-level principal functional gradient gradually increased with development (β= 0.111, 95% CI = [0.022, 0.199], p = 1.452 x 10<sup>-2</sup>, Cohen’s d = 0.367), as shown in Figure 1g, and decreased with higher head motion ( β = -10.041, 95% CI = [12.379, -7.702], p = 3.900 x 10<sup>-17</sup>), with no effect of sex (β= 0.071, 95% CI = [-0.380, 0.523], p = 0.757). We observed no developmental effects on the variance explained by the second (r<sub>s</sub> = 0.112, p = 0.139) or third (r<sub>s</sub> = 0.053, p = 0.482) group-level functional gradient. When repeated with the baseline functional connectivity for NKI (N = 213), we observed no developmental effects (β = 0.097, 95% CI = [-0.035, 0.228], p = 0.150) on the variance explained by the principal functional gradient after accounting for motion (β= -3.376, 95% CI = [8.281, 1.528], p = 0.177) and sex (β = -0.368, 95% CI = [-1.078, 0.342], p = 0.309). However, we observed significant developmental correlations between age and variance (r<sub>s</sub> = 0.137, p = 0.046) explained before accounting for head motion and sex. We observed no developmental effects on the variance explained by the second functional gradient (r<sub>s</sub> = -0.066, p = 0.338), but a weak negative developmental effect on the variance explained by the third functional gradient (r<sub>s</sub> = -0.189, p = 0.006). Note, however, the magnitude of the variance accounted for by the third functional gradient was very small (all < 1%). When applied to communicability matrices in CALM, the proportion of variance accounted for by the group-level communicability gradient was negligible (all < 1%), precluding analysis of developmental change”. 

      “To further probe the consistency of gradients across development, we examined developmental changes in the standard deviation of gradient values, corresponding to heterogeneity, following prior work examining morphological (He et al., 2025) and functional connectivity gradients (Xia et al., 2022). Using a series of generalised linear models within the baseline referred subset of CALM, correcting for head motion and sex, we found that gradient variation for the principal functional gradient increased across development (= 0.219, 95% CI = [0.091, 0.347], p = 0.001, Cohen’s d = 0.504), indicating greater heterogeneity (Figure 1h), whilst gradient variation for the principal communicability gradient decreased across development (β = -0.154, 95% CI = [-0.267, -0.040], p = 0.008, Cohen’s d = -0.301), indicating greater homogeneity (Figure 1h). Note, a paired t-test on the 173 common participants demonstrated a significant effect of modality on gradient variability (t(172) = -56.639, p = 3.663 x 10<sup>-113</sup>), such that the mean variability of communicability gradients (M = 0.033, SD = 0.001) was less than half that of functional connectivity (M = 0.076, SD = 0.010). Together, this suggests that principal functional connectivity and communicability gradients are established early in childhood and display age-related refinement, but not replacement”. 

      The Issue of Abstraction and Benefits of the Gradient-Based View: The manuscript interprets the eccentricity findings as reflecting changes along the segregation-integration spectrum. Given this, it is unclear why a more straightforward analysis using established graph-theory metrics of segregationintegration was not pursued instead. Mapping gradients and computing eccentricity adds layers of abstraction and complexity. If similar interpretations can be derived directly from simpler graph metrics, what additional insights does the gradient-based framework offer? While the manuscript argues that this approach provides “a more unifying account of cortical reorganization”, it is not evident why this abstraction is necessary or advantageous over traditional graph metrics. Clarifying these benefits would strengthen the rationale for using this method. 

      This is a great point, and something we spent quite a bit of time considering when designing the analysis. The central goal of our project was to identify gradients of brain organisation across different datasets and modalities and then test how the organisational principles of those modalities align. In other words, how do structural and functional ‘spaces’ intersect, and does this vary across the cortex? That for us was the primary motivation for operationalising organisation as nodal location within a low-dimensional manifold space (Bethlehem et al., 2020; Gale et al., 2022; Park et al., 2021), using a simple composite measure to achieve compression, rather than as a series of graph metrics. The reason we subsequently calculated those graph metrics and tested for their association was simply to help us interpret what eccentricity within that lowdimensional space means. Manifold eccentricity was moderately positively correlated to graph-theory metrics of integration, leaving a substantial portion of variance unaccounted for, but that association we think is nonetheless helpful for readers trying to interpret eccentricity. However, since ME tells us about the relative position of a node in that low-dimensional space, it is also likely capturing elements of multiple graph theory measures. Following the Reviewer’s question, this is something we decided to test. Specifically, using 4 measures of segregation, including two new metrics requested by the Reviewer in a minor point (weighted clustering coefficient and normalized degree centrality), we conducted a dominance analysis (Budescu, 1993) with normalized manifold eccentricity of the group-level referred CALM structural connectome. We also detail the use of gradient measures in developmental contexts, and how they can be complementary to traditional graph theory metrics. 

      We have added the following to the Results section (Page 10, Lines 472 onwards): 

      “To further contextualise manifold eccentricity in terms of integration and segregation beyond simple correlations, we conducted a multivariate dominance analysis (Budescu, 1993) of four graph theory metrics of segregation as predictors of nodal normalized manifold eccentricity within the group-level referred CALM structural and functional connectomes (Figure 2c). A dominance analysis assesses the relative importance of each predictor in a multilinear regression framework by fitting 2<sup>n</sup> – 1 models (where n is the number of predictors) and calculating the relative increase in adjusted R2 caused by adding each predictor to the model across both main effects and interactions. A multilinear regression model including weighted clustering coefficient, within-module degree Z-score, participation coefficient and normalized degree centrality accounted for 59% of variance in nodal manifold eccentricity in the group-level CALM structural connectome. Withinmodule degree Z score was the most important predictor (40.31% dominance), almost twice that of the participation coefficient (24.03% dominance) and normalized degree centrality (24.05% dominance) which made roughly equal contributions. The least important predictor was the weighted clustering coefficient (11.62% dominance). When the same approach was applied for the group-level referred CALM functional connectome, the 4 predictors accounted for 52% variability. However, in contrast to the structural connectome, functional manifold eccentricity seemed to incorporate the same graph theory metrics in different proportions. Normalized degree centrality was the most important predictor (47.41% dominance), followed by withinmodule degree Z-score (24.27%), and then the participation coefficient (15.57%) and weighted clustering coefficient (12.76%) which made approximately equal contributions. Thus, whilst structural manifold eccentricity was dominated most by within-module degree Z-score and least by the weighted clustering coefficient, functional manifold eccentricity was dominated most by normalized degree centrality and least by the weighted clustering coefficient. This suggests that manifold mapping techniques incorporate different aspects of integration dependent on modality. Together, manifold eccentricity acts as a composite measure of segregation, being differentially sensitive to different aspects of segregation, without necessitating a priori specification of graph theory metrics. Further discussion of the value of gradient-based metrics in developmental contexts and as a supplement to traditional graph theory analyses is provided in the ‘Manifold Eccentricity’ methodology sub-section”. 

      We added further justification to the manifold eccentricity Methods subsection (Page 26, line 1283):

      “Gradient-based measures hold value in developmental contexts, above and beyond traditional graph theory metrics: within a sample of over 600 cognitively-healthy adults aged between 18 and 88 years old, sensitivity of gradient-based within-network functional dispersion to age were stronger and more consistent across networks compared to segregation (Bethlehem et al., 2020). In the context of microstructural profile covariance, modules resolved by Louvain community detection occupied distinct positions across the principal two gradients, suggesting that gradients offer a way to meaningfully order discrete graph theory analyses (Paquola et al., 2019)”. 

      We added the following to the Introduction section outlining the application of gradients as cortex-wide coordinate systems (Page 3, Line 121):

      “Using the gradient-based approach as a compression tool, thus forgoing the need to specify singular graph theory metrics a priori, we operationalised individual variability in low-dimensional manifolds as eccentricity (Gale et al., 2022; Park et al., 2021). Crucially, such gradients appear to be useful predictors of phenotypic variation, exceeding edge-level connectomics. For example, in the case of functional connectivity gradients, their predictive ability for externalizing symptoms and general cognition in neurotypical adults surpassed that of edge-level connectome-based predictive modelling (Hong et al., 2020), suggesting that capturing lowdimensional manifolds may be particularly powerful biomarkers of psychopathology and cognition”. 

      We also added the following to the Discussion section (Page 18, Line 839):

      “By capitalising on manifold eccentricity as a composite measure of segregation across development, we build upon an emerging literature pioneering gradients as a method to establish underlying principles of structural (Paquola et al., 2020; Park et al., 2021) and functional (Dong et al., 2021; Margulies et al., 2016; Xia et al., 2022) brain development without a priori specification of specific graph theory metrics of interest”. 

      It is unclear whether the statistical tests finding significant dataset effects are capturing effects of neurotypical vs. Neurodivergent, or simply different scanners/sites. Could the neurotypical portion of CALM also be added to distinguish between these two sources of variability affecting dataset effects (i.e. ideally separating this to the effect of site vs. neurotypicality would better distinguish the effect of neurodivergence).

      At a group-level, differences in the gradients between the two cohorts are very minor. Indeed, in the manuscript we describe these gradients as being seemingly ‘universal’. But we agree that we should test whether we can directly attribute any simple main effects of ‘dataset’ are resulting from the different site or the phenotype of the participants. The neurotypical portion of CALM (collected at the same site on the same scanner) helped us show that any minor differences in the gradient alignments is likely due to the site/scanner differences rather than the phenotype of the participants. We took the same approach for testing the simple main effects of dataset on manifold eccentricity. To better parse neurotypicality and site effects at an individual-level, we conducted a series of sensitivity analyses. First, in response to the reviewer’s earlier comment, we conducted a series of nodal generalized linear models for communicability and FC gradients derived from neurotypical and neurodivergent portions of CALM, alongside NKI, and tested for an effect of neurotypicality above and beyond scanner. As at the group level, having those additional scans on a ‘comparison’ sample for CALM is very helpful in teasing apart these effects. We find that neurotypicality affects communicability gradient expression to a greater degree than functional connectivity. We visualised these results and added them to Figure 1. Second, we used the same approach but for manifold eccentricity. Again, we demonstrate greater sensitivity of neurotypicality to communicability at a global-level, but we cannot pin these effects down to specific networks because the effects do not survive the necessary multiple comparison correction. We have added these analyses to the manuscript (Page 13, Line 583): 

      “Much as with the gradients themselves, we suspected that much of the simple main effect of dataset could reflect the scanner / site, rather than the difference in phenotype. Again, we drew upon the CALM comparison children to help us disentangle these two explanations. As a sensitivity analysis to parse effects of neurotypicality and dataset on manifold eccentricity, we conducted a series of generalized linear models predicting mean global and network-level manifold eccentricity, for each modality. We did this across all the baseline data (i.e. including the neurotypical comparison sample for CALM) using neurotypicality (2 levels: neurodivergent or neurotypical), site (2 levels: CALM or NKI), sex, head motion, and age at scan (Figure 3X). We restricted our analysis to baseline scans to create more equally-balanced groups. In terms of structural manifold eccentricity (N = 313 neurotypical, N = 311 neurodivergent), we observed higher manifold eccentricity in the neurodivergent participants at a global level (β = 0.090, p = 0.019, Cohen’s d = 0.188) but the individual network level effects did not survive the multiple comparison correction necessary for looking across all seven networks, with the default-mode network being the strongest (β = 0.135, p = 0.027, p<sub>FDR</sub> = 0.109, Cohen’s d = 0.177). There was no significant effect of neurodiversity on functional manifold eccentricity (N = 292 neurotypical and N = 177 neurodivergent). This suggests that neurodiversity is significantly associated with structural manifold eccentricity, over and above differences in site, but we cannot distinguish these effects reliably in the functional manifold data”. 

      Third, we removed the Scheirer-Ray-Hare test from the results for two reasons. First, its initial implementation did not account for repeated measures, and therefore non-independence between observations, as the same participants may have contributed both structural and functional data. Second, if we wanted to repeat this analysis in CALM using the referred and control portions, a significant difference in group size existed, which may affect the measures of variability. Specifically, for baseline CALM, 311 referred and 91 control participants contributed SC data, whilst 177 referred and 79 control participants contributed FC data. We believe that the ‘cleanest’ parsing of dataset and site for effects of eccentricity is achieved using the GLMs in Figure 3. 

      We observed no significant effect of neurodivergence on the magnitude of structure-function coupling across development, and have added the following text (Page 14, Line 632):

      “To parse effects of neurotypicality and dataset on structure-function coupling, we conducted a series of generalized linear models predicting mean global and network-level coupling using neurotypicality, site, sex, head motion, and age at scan, at baseline (N = 77 CALM neurotypical, N = 173 CALM neurodivergent, and N = 170 NKI). However, we found no significant effects of neurotypicality on structure-function coupling across development”. 

      Since we demonstrated no significant effects of neurotypicality on structure-function coupling magnitude across development, but found differential dataset-specific effects of age on coupling development, we added the following sentence at the end of the coupling trajectory results sub-section (Page 14, line 664):

      “Together, these effects demonstrate that whilst the magnitude of structure-function coupling appears not to be sensitive to neurodevelopmental phenotype, its development with age is, particularly in higher-order association networks, with developmental change being reduced in the neurodivergent sample”.  

      Figure 1.c: A non-parametric permutation test (e.g. Mann-Whitney U test) could quantitatively identify regions with significant group differences in nodal gradient values, providing additional support for the qualitative findings.

      This is a great idea. To examine the effect of referral status on nodal gradient values, whilst controlling for covariates (head motion and sex), we conducted a series of generalised linear models. We opted for this instead of a Mann-Whitney U test, as the former tests for differences in distributions, whilst the direction of the t-statistic for referral status from the GLM would allow us to specify the magnitude and direction of differences in nodal gradient values between the two groups. Again, we conducted this in CALM (referred vs control), at an individual-level, as downstream analyses suggested a main effect of dataset (which is reflected in the highly-similar group-level referred and control CALM gradients). We have updated the Results section with the following text (Page 6, Line 283):

      “To examine the effect of referral status on participant-level nodal gradient values in CALM, we conducted a series of generalized linear models controlling for head motion, sex and age at scan (Figure 1d). We restricted our analyses to baseline scans to reduce the difference in sample size for the referred (311 communicability and 177 functional gradients, respectively) and control participants (91 communicability and 79 functional gradients, respectively), and to the principal gradients. For communicability, 42 regions showed a significant effect (p < 0.05) of neurodivergence before FDR correction, with 9 post FDR correction. 8 of these 9 regions had negative t-statistics, suggesting a reduced nodal gradient value and representation in the neurodivergent children, encompassing both lower-order somatosensory cortices alongside higher-order fronto-parietal and default-mode networks. The largest reductions were observed within the prefrontal cortices of the defaultmode network (t = -3.992, p = 6.600 x 10<sup>-5</sup>, p<sub>FDR</sub> = 0.013, Cohen’s d = -0.476), the left orbitofrontal cortex of the limbic network (t = -3.710, p = 2.070 x 10<sup>-4</sup>, p<sub>FDR</sub> = 0.020, Cohen’s d = -0.442) and right somato-motor cortex (t = -3.612, p = 3.040 x 10<sup>-4</sup>, p<sub>FDR</sub> = 0.020, Cohen’s d = -0.431). The right visual cortex was the only exception, with stronger gradient representation within the neurotypical cohort (t = 3.071, p = 0.002, p<sub>FDR</sub> = 0.048, Cohen’s d = 0.366). For functional connectivity, comparatively fewer regions exhibited a significant effect (p < 0.05) of neurotypicality, with 34 regions prior to FDR correction and 1 post. Significantly stronger gradient representation was observed in neurotypical children within the right precentral ventral division of the defaultmode network (t = 3.930, p = 8.500 x 10<sup>-5</sup>, p<sub>FDR</sub> = 0.017, Cohen’s d = 0.532). Together, this suggests that the strongest and most robust effects of neurodivergence are observed within gradients of communicability, rather than functional connectivity, where alterations in both affect higher-order associative regions”. 

      In the harmonization methodology, it is mentioned that “if harmonisation was successful, we’d expect any significant effects of scanner type before harmonisation to be non-significant after harmonisation”. However, given that there were no significant effects before harmonization, the results reported do not help in evaluating the quality of harmonization.

      We agree with the Reviewer, and have removed the post-harmonisation GLMs, and instead stating that there were no significant effects of scanner type before harmonization. 

      Figure 3: It would be helpful to include a plot showing the GAMM predictions versus real observations of eccentricity (x-axis: predictions, y-axis: actual values). 

      To plot the GAMM-predicted smooth effects of age, which we used for visualisation purposes only, we used the get_predictions function from the itsadug R package. This creates model predictions using the median value of nuisance covariates. Thus, whilst we specified the entire age range, the function automatically chooses the median of head motion, alongside controlling for sex (default level: male) and, for each dataset-specific trajectory. Since the gamm4 package separates the fitted model into a gam and linear mixed effects model (which accounts for participant ID as a random effect), and the get_predictions function only uses gam, random effects are not modelled in the predicted smooths. Therefore, any discrepancy between the observed and predicted manifold eccentricity values is likely due to sensitivity to default choices of covariates other than age, or random effects. To prevent Figure 3 being too over-crowded, we opted to not include the predictions: these were strongly correlated with real structural manifold data, but less for functional manifold data especially where significant developmental change was absent.

      The 30mm threshold for filtering short streamlines in tractography is uncommon. What is the rationale for using such a large threshold, given the potential exclusion of many short-range association fibres?

      A minimum length of 30mm was the default for the MRtrix3 reconstruction workflow, and something we have previously used. In a previous project, we systematically varied the minimum fibre length and found that this had minimal impact on network organisation (e.g. Mousley et al. 2025). However, we accept that short-range association fibres may have been excluded and have included this in the Discussion as a methodological limitation, alongside our predictions for how the gradients and structure-function coupling may’ve been altered had we included such fibres (Page 20, Line 955):

      “A potential methodological limitation in the construction of structural connectomes was the 30mm tract length threshold which, despite being the QSIprep reconstruction default (Cieslak et al., 2021), may have potentially excluded short-range association fibres. This is pertinent as tracts of different lengths exhibit unique distributions across the cortex and functional roles (Bajada et al., 2019) : short-range connections occur throughout the cortex but peak within primary areas, including the primary visual, somato-motor, auditory, and para-hippocampal cortices, and are thought to dominate lower-order sensorimotor functional resting-state networks, whilst long-range connections are most abundant in tertiary association areas and are recruited alongside tracts of varying lengths within higher-order functional resting-state networks. Therefore, inclusion of short-range association fibres may have resulted in a relative increase in representation of lower-order primary areas and functional networks. On the other hand, we also note the potential misinterpretation of short-range fibres: they may be unreliably distinguished from null models in which tractography is restricted by cortical gyri only (Bajada et al., 2019). Further, prior (neonatal) work has demonstrated that the order of connectivity of regions and topological fingerprints are consistent across varying streamline thresholds (Mousley et al., 2025), suggesting minimal impact”. 

      Given the spatial smoothing of fMRI data (6mm FWHM), it would be beneficial to apply connectome spatial smoothing to structural connectivity measures for consistent spatial smoothness.

      This is an interesting suggestion but given we are looking at structural communicability within a parcellated network, we are not sure that it would make any difference. The data structural data are already very smooth. Nonetheless we have added the following text to the Discussion (Page 20, Line 968): 

      “Given the spatial smoothing applied to the functional connectivity data, and examining its correspondence to streamline-count connectomes through structure-function coupling, applying the equivalent smoothing to structural connectomes may improve the reliability of inference, and subsequent sensitivity to cognition and psychopathology. Connectome spatial smoothing involves applying a smoothing kernel to the two streamline endpoints, whereby variations in smoothing kernels are selected to optimise the trade-off between subjectlevel reliability and identifiability, thus increasing the signal-to-noise ratio and the reliability of statistical inferences of brain-behaviour relationships (Mansour et al., 2022). However, we note that such smoothing is more effective for high-resolution connectomes, rather than parcel-level, and so have only made a modest improvement (Mansour et al., 2022)”.

      Why was harmonization performed only within the CALM dataset and not across both CALM and NKI datasets? What was the rationale for this decision?

      We thought about this very carefully. Harmonization aims to remove scanner or site effects, whilst retaining the crucial characteristics of interest. Our capacity to retain those characteristics is entirely dependent on them being *fully* captured by covariates, which are then incorporated into the harmonization process. Even with the best set of measures, the idea that we can fully capture ‘neurodivergence’ and thus preserve it in the harmonisation process is dubious. Indeed, across CALM and NKI there are limited number of common measures (i.e. not the best set of common measures), and thus we are limited in our ability to fully capture the neurodivergence with covariates. So, we worried that if we put these two very different datasets into the harmonisation process we would essentially eliminate the interesting differences between the datasets. We have added this text to the harmonization section of the Methods (Page 24, Line 1225):

      “Harmonization aims to retain key characteristics of interest whilst removing scanner or site effects. However, the site effects in the current study are confounded with neurodivergence, and it is unlikely that neurodivergence may be captured fully using common covariates across CALM and NKI. Therefore, to preserve variation in neurodivergence, whilst reducing scanner effects, we harmonized within the CALM dataset only”. 

      The exclusion of subcortical areas from connectivity analyses is not justified. 

      This is a good point. We used the Schaefer atlas because we had previously used this to derive both functional and structural connectomes, but we agree that it would have been good to include subcortical areas (Page 20, Line 977). 

      “A potential limitation of our study was the exclusion of subcortical regions. However, prior work has shed light on the role of subcortical connectivity in structural and functional gradients, respectively, of neurotypical populations of children and adolescents (Park et al., 2021; Xia et al., 2022). For example, in the context of the primary-to-transmodal and sensorimotor-to-visual functional connectivity gradients, the mean gradient scores within subcortical networks were demonstrated to be relatively stable across childhood and adolescence (Xia et al., 2022). In the context of structural connectivity gradients derived from streamline counts, which we demonstrated were highly consistent with those derived from communicability, subcortical structural manifolds weighted by their cortical connectivity were anchored by the caudate and thalamus at one pole, and by the hippocampus and nucleus accumbens at the opposite pole, with significant age-related manifold expansion within the caudate and thalamus (Park et al., 2021)”. 

      In the KNN imputation method, were uniform weights used, or was an inverse distance weighting applied?

      Uniform weights were used, and we have updated the manuscript appropriately.

      The manuscript should clarify from the outset that the reported sample size (N) includes multiple longitudinal observations from the same individuals and does not reflect the number of unique participants.

      We have rectified the Abstract (Page 2, Line 64) and Introduction (Page 3, Line 138):

      “We charted the organisational variability of structural (610 participants, N = 390 with one observation, N = 163 with two observations, and N = 57 with three) and functional (512 participants, N = 340 with one observation, N = 128 with two observations, and N = 44 with three)”.

      The term “structural gradients” is ambiguous in the introduction. Clarify that these gradients were computed from structural and functional connectivity matrices, not from other structural features (e.g. cortical thickness).

      We have clarified this in the Introduction (Page 3, Line 134):

      “Applying diffusion-map embedding as an unsupervised machine-learning technique onto matrices of communicability (from streamline SIFT2-weighted fibre bundle capacity) and functional connectivity, we derived gradients of structural and functional brain organisation in children and adolescents…”

      Page 5: The sentence, “we calculated the normalized angle of each structural and functional connectome to derive symmetric affinity matrices” is unclear and needs clarification.

      We have clarified this within the second paragraph of the Results section (Page 4, Line 185):

      “To capture inter-nodal similarity in connectivity, using a normalised angle kernel, we derived individual symmetric affinity matrices from the left and right hemispheres of each communicability and functional connectivity matrix. Varying kernels capture different but highly-related aspects of inter-nodal similarity, such as correlation coefficients, Gaussian kernels, and cosine similarity. Diffusion-map embedding is then applied on the affinity matrices to derive gradients of cortical organisation”. 

      Figure 1.a: “Affine A” likely refers to the affinity matrix. The term “affine” may be confusing; consider using a clearer label. It would also help to add descriptive labels for rows and columns (e.g. region x region).

      Thank you for this suggestion! We have replaced each of the labels with “pairwise similarity”. We also labelled the rows and columns as regions.

      Figure 1.d: Are the cross-group differences statistically significant? If so, please indicate this in the figure.

      We have added the results of a series of linear mixed effects models to the legend of Figure 1 (Page 6, line 252):

      “indicates a significant effect of dataset (p < 0.05) on variance explained within a linear mixed effects model controlling for head motion, sex, and age at scan”.

      The sentence “whose connectomes were successfully thresholded” in the methods is unclear. What does “successfully thresholded” mean? Additionally, this seems to be the first mention of the Schaefer 100 and Brainnetome atlas; clarify where these parcellations are used. 

      We have amended the Methodology section (Page 23, Line 1138):

      “For each participant, we retained the strongest 10% of connections per row, thus creating fully connected networks required for building affinity matrices. We excluded any connectomes in which such thresholding was not possible due to insufficient non-zero row values. To further ensure accuracy in connectome reconstruction, we excluded any participants whose connectomes failed thresholding in two alternative parcellations: the 100node Schaefer 7-network (Schaefer et al., 2018) and Brainnetome 246-node (Fan et al., 2016) parcellations, respectively”. 

      We have also specified the use of the Schaefer 200-node parcellation in the first sentence on the second Results paragraph.

      The use of “streamline counts” is misleading, as the method uses SIFT2-weighted fibre bundle capacity rather than raw streamline counts. It would be better to refer to this measure as “SIFT2-weighted fibre bundle capacity” or “FBC”.

      We replaced all instances of “streamline counts” with “SIFT2-weighted fibre bundle capacity” as appropriate.

      Figure 2.c: Consider adding plots showing changes in eccentricity against (1) degree centrality, and (2) weighted local clustering coefficient. Additionally, a plot showing the relationship between age and mean eccentricity (averaged across nodes) at the individual level would be informative.

      We added the correlation between eccentricity and both degree centrality and the weighted local clustering coefficient and included them in our dominance analysis in Figure 2. In terms of the relationship between age and mean (global) eccentricity, these are plotted in Figure 3. 

      Figure 2.b: Considering the results of the following sections, it would be interesting to include additional KDE/violin plots to show group differences in the distribution of eccentricity within 7 different functional networks.

      As part of our analysis to parse neurotypicality and dataset effects, we tested for group differences in the distribution of structural and functional manifold eccentricity within each of the 7 functional networks in the referred and control portions of CALM and have included instances of significant differences with a coloured arrow to represent the direction of the difference within Figure 3. 

      Figure 3: Several panels lack axis labels for x and y axes. Adding these would improve clarity.

      To minimise the amount of text in Figure 3, we opted to include labels only for the global-level structural and functional results. However, to aid interpretation, we added a small schematic at the bottom of Figure 3 to represent all axis labels. 

      The statement that “differences between datasets only emerged when taking development into account” seems inaccurate. Differences in eccentricity are evident across datasets even before accounting for development (see Fig 2.b and the significance in the Scheirer-Ray-Hare test).

      We agree – differences in eccentricity across development and datasets are evident in structural and functional manifold eccentricity, as well as within structure-function coupling. However, effects of neurotypicality were particularly strong for the maturation of structure-function coupling, rather than magnitude. Therefore, we have rephrased this sentence in the Discussion (page 18, line 832):

      “Furthermore, group-level structural and functional gradients were highly consistent across datasets, whilst differences between datasets were emphasised when taking development into account, through differing rates of structural and functional manifold expansion, respectively, alongside maturation of structure-function coupling”.

      The handling of longitudinal data by adding a random effect for individuals is not clear in the main text. Mentioning this earlier could be helpful. 

      We have included this detail in the second sentence of the “developmental trajectories of structural manifold contraction and functional manifold expansion” results sub-section (page 11, line 503):

      “We included a random effect for each participant to account for longitudinal data”. 

      Figure 4.b: Why were ranks shown instead of actual coefficient of variation values? Consider including a cortical map visualization of the coefficients in the supplementary material.

      We visualised the ranks, instead of the actual coefficient of variation (CV) values, due to considerable variability and skew in the magnitude of the CV, ranging from 28.54 (in the right visual network) to 12865.68 (in the parietal portion of the left default-mode network), with a mean of 306.15. If we had visualised the raw CV values, these larger values would’ve been over-represented. We’ve also noticed and rectified an error in the labelling of the colour bar for Figure 4b: the minimum should be most variable (i.e. a rank of 1). To aid contextualisation of the ranks, we have added the following to the Results (page 14, line 626):

      “The distribution of cortical coefficients of variation (CV) varied considerably, with the largest CV (in the parietal division of the left default-mode network) being over 400 times that of the smallest (in the right visual network). The distribution of absolute CVs was positively skewed, with a Fisher skewness coefficient g<sub>1</sub> of 7.172, meaning relatively few regions had particularly high inter-individual variability, and highly peaked, with a kurtosis of 54.883, where a normal distribution has a skewness coefficient of 0 and a kurtosis of 3”. 

      Reviewer #2 (Public review):

      Some differences in developmental trajectories between CALM and NKI (e.g. Figure 4d) are not explained. Are these differences expected, or do they suggest underlying factors that require further investigation?

      This is a great point, and we appreciate the push to give a fuller explanation. It is very hard to know whether these effects are expected or not. We certainly don’t know of any other papers that have taken this approach. In response to the reviewer’s point, we decided to run some more analyses to better understand the differences. Having observed stronger age effects on structure-function coupling within the neurotypical NKI dataset, compared to the absent effects in the neurodivergent portion of CALM, we wanted to follow up and test that it really is that coupling is more sensitive to the neurodivergent versus neurotypical difference between CALM and NKI (rather than say, scanner or site effects). In short, we find stronger developmental effects of coupling within the neurotypical portion of CALM, rather than neurodivergent, and have added this to the Results (page 15, line 701):

      “To further examine whether a closer correspondence of structure-function coupling with age is associated with neurotypicality, we conducted a follow-up analysis using the additional age-matched neurotypical portion of CALM (N = 77). Given the widespread developmental effects on coupling within the neurotypical NKI sample, compared to the absent effects in the neurodivergent portion of CALM, we would expect strong relationships between age and structure-function coupling with the neurotypical portion of CALM. This is indeed what we found: structure-function coupling showed a linear negative relationship with age globally (F = 16.76, p<sub>FDR</sub> < 0.001, adjusted R<sup>2</sup> = 26.44%), alongside fronto-parietal (F = 9.24, p<sub>FDR</sub> = 0.004, adjusted R<sup>2</sup> = 19.24%), dorsalattention (F = 13.162, p<sub>FDR</sub> = 0.001, adjusted R<sup>2</sup>= 18.14%), ventral attention (F = 11.47, p<sub>FDR</sub>  = 0.002, adjusted R<sup>2</sup>= 22.78), somato-motor (F = 17.37, p<sub>FDR</sub>  < 0.001, adjusted R<sup>2</sup>= 21.92%) and visual (F = 11.79, p<sub>FDR</sub>  = 0.002, adjusted R<sup>2</sup>= 20.81%) networks. Together, this supports our hypothesis that within neurotypical children and adolescents, structure-function coupling decreases with age, showing a stronger effect compared to their neurodivergent counterparts, in tandem with the emergence of higher-order cognition. Thus, whilst the magnitude of structure-function coupling across development appeared insensitive to neurotypicality, its maturation is sensitive. Tentatively, this suggests that neurotypicality is linked to stronger and more consistent maturational development of structure-function coupling, whereby the tethering of functional connectivity to structure across development is adaptive”. 

      In conjunction with the Reviewer’s later request to deepen the Discussion, we have included an additional paragraph attempting to explain the differences in neurodevelopmental trajectories of structure-function coupling (Page 19, Line 924):

      “Whilst the spatial patterning of structure-function coupling across the cortex has been extensively documented, as explained above, less is known about developmental trajectories of structure-function coupling, or how such trajectories may be altered in those with neurodevelopmental conditions. To our knowledge, only one prior study has examined differences in developmental trajectories of (non-manifold) structure-function coupling in typically-developing children and those with attention-deficit hyperactivity disorder (Soman et al., 2023), one of the most common conditions in the neurodivergent portion of CALM. Namely, using cross-sectional and longitudinal data from children aged between 9 and 14 years old, they demonstrated increased coupling across development in higher-order regions overlapping with the defaultmode, salience, and dorsal attention networks, in children with ADHD, with no significant developmental change in controls, thus encompassing an ectopic developmental trajectory (Di Martino et al., 2014; Soman et al., 2023). Whilst the current work does not focus on any condition, rather the broad mixed population of young people with neurodevelopmental symptoms (including those with and without diagnoses), there are meaningful individual and developmental differences in structure-coupling. Crucially, it is not the case that simply having stronger coupling is desirable. The current work reveals that there are important developmental trajectories in structure-function coupling, suggesting that it undergoes considerable refinement with age. Note that whilst the magnitude of structure-function coupling across development did not differ significantly as a function of neurodivergence, its relationship to age did. Our working hypothesis is that structural connections allow for the ordered integration of functional areas, and the gradual functional modularisation of the developing brain. For instance, those with higher cognitive ability show a stronger refinement of structurefunction coupling across development. Future work in this space needs to better understand not just how structural or functional organisation change with time, but rather how one supports the other”. 

      The use of COMBAT may have excluded extreme participants from both datasets, which could explain the lack of correlations found with psychopathology.

      COMBAT does not exclude participants from datasets but simply adjusts connectivity estimates. So, the use of COMBAT will not be impacting the links with psychopathology by removing participants. But this did get us thinking. Excluding participants based on high motion may have systematically removed those with high psychopathology scores, meaning incomplete coverage. In other words, we may be under-representing those at the more extreme end of the range, simply because their head-motion levels are higher and thus are more likely to be excluded. We found that despite certain high-motion participants being removed, we still had good coverage of those with high scores and were therefore sensitive within this range. We have added the following to the revised Methods section (Page 26, Line 1338):

      “As we removed participants with high motion, this may have overlapped with those with higher psychopathology scores, and thus incomplete coverage. To examine coverage and sensitivity to broad-range psychopathology following quality control, we calculated the Fisher-Pearson skewness statistic g<sub>1</sub> for each of the 6 Conners t-statistic measures and the proportion of youth with a t-statistic equal to or greater than 65, indicating an elevated or very elevated score. Measures of inattention (g<sub>1</sub> = 0.11, 44.20% elevated), hyperactivity/impulsivity (g<sub>1</sub> = 0.48, 36.41% elevated), learning problems (g<sub>1</sub> = 0.45, 37.36% elevated), executive functioning (g<sub>1</sub> = 0.27, 38.16% elevated), aggression (g<sub>1</sub> = 1.65, 15.58% elevated), and peer relations (g<sub>1</sub> = 0.49, 38% elevated) were positively skewed and comprised of at least 15% of children with elevated or very elevated scores, suggesting sufficient coverage of those with extreme scores”. 

      There is no discussion of whether the stable patterns of brain organization could result from preprocessing choices or summarizing data to the mean. This should be addressed to rule out methodological artifacts. 

      This is a brilliant point. We are necessarily using a very lengthy pipeline, with many design choices to explore structural and functional gradients and their intersection. In conjunction with the Reviewer’s later suggestion to deepen the Discussion, we have added the following paragraph which details the sensitivity analyses we carried out to confirm the observed stable patterns of brain organization (Page 18, Line 863):

      “That is, whilst we observed developmental refinement of gradients, in terms of manifold eccentricity, standard deviation, and variance explained, we did not observe replacement. Note, as opposed to calculating gradients based on group data, such as a sliding window approach, which may artificially smooth developmental trends and summarise them to the mean, we used participant-level data throughout. Given the growing application of gradient-based analyses in modelling structural (He et al., 2025; Li et al., 2024) and functional (Dong et al., 2021; Xia et al., 2022) brain development, we hope to provide a blueprint of factors which may affect developmental conclusions drawn from gradient-based frameworks”.

      Although imputing missing data was necessary, it would be useful to compare results without imputed data to assess the impact of imputation on findings. 

      It is very hard to know the impact of imputation without simply removing those participants with some imputed data. Using a simulation experiment, we expressed the imputation accuracy as the root mean squared error normalized by the range of observable data in each scale. This produced a percentage error margin. We demonstrate that imputation accuracy across all measures is at worst within approximately 11% of the observed data, and at best within approximately 4% of the observed data, and have included the following in the revised Methods section (Page 27, Line 1348):

      “Missing data

      To avoid a loss of statistical power, we imputed missing data. 27.50% of the sample had one or more missing psychopathology or cognitive measures (equal to 7% of all values), and the data was not missing at random: using a Welch’s t-test, we observed a significant effect of missingness on age [t (264.479) = 3.029, p = 0.003, Cohen’s d = 0.296], whereby children with missing data (M = 12.055 years, SD = 3.272) were younger than those with complete data (M = 12.902 years, SD = 2.685). Using a subset with complete data (N = 456), we randomly sampled 10% of the values in each column with replacement and assigned those as missing, thereby mimicking the proportion of missingness in the entire dataset. We conducted KNN imputation (uniform weights) on the subset with complete data and calculated the imputation accuracy as the root mean squared error normalized by the observed range of each measure. Thus, each measure was assigned a percentage which described the imputation margin of error. Across cognitive measures, imputation was within a 5.40% mean margin of error, with the lowest imputation error in the Trail motor speed task (4.43%) and highest in the Trails number-letter switching task (7.19%). Across psychopathology measures, imputation exhibited a mean 7.81% error margin, with the lowest imputation error in the Conners executive function scale (5.75%) and the highest in the Conners peer relations scale (11.04%). Together, this suggests that imputation was accurate”.

      The results section is extensive, with many reports, while the discussion is relatively short and lacks indepth analysis of the findings. Moving some results into the discussion could help balance the sections and provide a deeper interpretation. 

      We agree with the Reviewer and appreciate the nudge to expand the Discussion section. We have added 4 sections to the Discussion. The first explores the importance of the default-mode network as a region whose coupling is most consistently predicted by working memory across development and phenotypes, in terms of its underlying anatomy (Paquola et al., 2025) (Page 20, Line 977):

      “An emerging theme from our work is the importance of the default-mode network as a region in which structure-function coupling is reliably predicted by working memory across neurodevelopmental phenotypes and datasets during childhood and adolescence. Recent neurotypical adult investigations combining highresolution post-mortem histology, in vivo neuroimaging, and graph-theory analyses have revealed how the underlying neuroanatomy of the default-mode network may support diverse functions (Paquola et al., 2025), and thus exhibit lower structure-function coupling compared to unimodal regions. The default-mode network has distinct neuroanatomy compared to the remaining 6 intrinsic resting-state functional networks (Yeo et al., 2011), containing a distinctive combination of 5 of the 6 von Economo and Koskinas cell types (von Economo & Koskinas, 1925), with an over-representation of heteromodal cortex, and uniquely balancing output across all cortical types. A primary cytoarchitectural axis emerges, beyond which are mosaic-like spatial topographies. The duality of the default-mode network, in terms of its ability to both integrate and be insulated from sensory information, is facilitated by two microarchitecturally distinct subunits anchored at either end of the cytoarchitectural axis (Paquola et al., 2025). Whilst beyond the scope of the current work, structure-function coupling and their predictive value for cognition may also differ across divisions within the default-mode network, particularly given variability in the smoothness and compressibility of cytoarchitectural landscapes across subregions (Paquola et al., 2025)”. 

      The second provides a deeper interpretation and contextualisation of greater sensitivity of communicability, rather than functional connectivity, to neurodivergence (Page 19, Lines 907):

      “We consider two possible factors to explain the greater sensitivity of neurodivergence to gradients of communicability, rather than functional connectivity. First, functional connectivity is likely more sensitive to head motion than structural-based communicability and suffers from reduced statistical power due to stricter head motion thresholds, alongside greater inter-individual variability. Second, whilst prior work contrasting functional connectivity gradients from neurotypical adults with those with confirmed ASD diagnoses demonstrated vertex-level reductions in the default-mode network in ASD and marginal increases in sensorymotor communities (Hong et al., 2019), indicating a sensitivity of functional connectivity to neurodivergence, important differences remain. Specifically, whilst the vertex-level group-level differences were modest, in line with our work, greater differences emerged when considering step-wise functional connectivity (SFC); in other words, when considering the dynamic transitions of or information flow through the functional hierarchy underlying the static functional connectomes, such that ASD was characterised by initial faster SFC within the unimodal cortices followed by a lack of convergence within the default-mode network (Hong et al., 2019). This emphasis on information flow and dynamic underlying states may point towards greater sensitivity of neurodivergence to structural communicability – a measure directly capturing information flow – than static functional connectivity”. 

      The third paragraph situates our work within a broader landscape of reliable brain-behaviour relationships, focusing on the strengths of combining clinical and normative samples to refine our interpretation of the relationship between gradients and cognition, as well as the importance of equifinality in developmental predictive work (Page 20, line 994):

      “In an effort to establish more reliable brain-behaviour relationships despite not having the statistical power afforded by large-scale, typically normative, consortia (Rosenberg & Finn, 2022), we demonstrated the development-dependent link between default-mode structure-function coupling and working memory generalised across clinical (CALM) and normative (NKI) samples, across varying MRI acquisition parameters, and harnessing within- and across-participant variation. Such multivariate associations are likely more reliable than their univariate counterparts (Marek et al., 2022), but can be further optimised using task-related fMRI (Rosenberg & Finn, 2022). The consistency, or lack of, of developmental effects across datasets emphasises the importance of validating brain-behaviour relationships in highly diverse samples. Particularly evident in the case of structure-function coupling development, through our use of contrasting samples, is equifinality (Cicchetti & Rogosch, 1996), a key concept in developmental neuroscience: namely, similar ‘endpoints’ of structure-function coupling may be achieved through different initialisations dependent on working memory. 

      The fourth paragraph details methodological limitations in response to Reviewer 1’s suggestions to justify the exclusion of subcortical regions and consider the role of spatial smoothing in structural connectome construction as well as the threshold for filtering short streamlines”. 

      While the methods are thorough, it is not always clear whether the optimal approaches were chosen for each step, considering the available data. 

      In response to Reviewer 1’s concerns, we conducted several sensitivity analyses to evaluate the robustness of our results in terms of procedure. Specifically, we evaluated the impact of thresholding (full or sparse), level of analysis (individual or group gradients), construction of the structural connectome (communicability or fibre bundle capacity), Procrustes rotation (alignment to group-level gradients before Procrustes), tracking the variance explained in individual connectomes by group-level gradients, impact of head motion, and distinguishing between site and neurotypicality effects. All these analyses converged on the same conclusion: whilst we observe some developmental refinement in gradients, we do not observe replacement. We refer the reviewer to their third point, about whether stable patterns of brain organization were artefactual. 

      The introduction is overly long and includes numerous examples that can distract readers unfamiliar with the topic from the main research questions. 

      We have removed the following from the Introduction, reducing it to just under 900 words:

      “At a molecular level, early developmental patterning of the cortex arises through interacting gradients of morphogens and transcription factors (see Cadwell et al., 2019). The resultant areal and progenitor specialisation produces a diverse pool of neurones, glia, and astrocytes (Hawrylycz et al., 2015). Across childhood, an initial burst in neuronal proliferation is met with later protracted synaptic pruning (Bethlehem et al., 2022), the dynamics of which are governed by an interplay between experience-dependent synaptic plasticity and genomic control (Gottlieb, 2007)”.

      “The trends described above reflect group-level developmental trends, but how do we capture these broad anatomical and functional organisational principles at the level of an individual?”

      We’ve also trimmed the second Introduction paragraph so that it includes fewer examples, such as removal of the wiring-cost optimisation that underlies structural brain development, as well as removing specific instances of network segregation and integration that occur throughout childhood.

    1. eLife Assessment

      In this valuable technical report, Verma et al. provide convincing evidence that endogenously tagged dynein and dynactin form processive motor complexes that move along microtubules in living cells. Using quantitative fluorescence microscopy, they directly compare the stoichiometry and motility of these complexes to kinesin-1, revealing distinct transport behaviors and regulatory properties. This study offers key methodological and conceptual advance for understanding the dynamics of native motor proteins within the cellular environment and will be of interest to the cell biology community.

    2. Reviewer #1 (Public Review):

      The manuscript by Verma et al. is a simple and concise assessment of the in-cell motility parameters of cytoplasmic dynein. Although numerous studies have focused on understanding the mechanism by which dynein is activated using a complement of in vitro methodologies, an assessment of dynein motility in cells has been lacking. It has been unclear whether dynein exhibits high processivity within the crowded and complicated environment of the cell. For example, does cargo-bound dynein exhibit short, non-processive motility (as has been recently suggested; Tirumala et al., 2022 bioRxiv)? Does cargo-bound dynein move against opposing forces generated by cargo-bound kinesins? Do cargoes exhibit bidirectional switching due to stochastic activation of kinesins and dyneins? The current work addresses these questions quite simply by observing and quantitating the motility of natively tagged dynein in HeLa cells.

    3. Reviewer #2 (Public Review):

      Verma et al. provide a short technical report showing that endogenously tagged dynein and dynactin molecules localize to growing microtubule plus-ends and also move processively along microtubules in cells. The data are convincing, and the imaging and movies very nicely demonstrate their claims. I don't have any large technical concerns about the work. It is perhaps not surprising that dynein-dynactin complexes behave this way in cells due to other reports on the topic, but the current data are among some of the nicest direct demonstrations of this phenomenon. It may be somewhat controversial since a separate group has reported that dynein does not move processively in mammalian cells

      (https://www.biorxiv.org/content/10.1101/2021.04.05.438428v3).

    4. Reviewer #3 (Public Review):

      In this manuscript, Verma et al. set out to visualize cytoplasmic dynein in living cells and describe their behaviour. They first generated heterozygous CRISPR-Cas9 knock-ins of DHC1 and p50 subunit of dynactin and used spinning disk confocal microscopy and TIRF microscopy to visualize these EGFP-tagged molecules. They describe robust localization and movement of DHC and p50 at the plus tips of MTs, which was abrogated using SiR tubulin to visualize the pool of DHC and p50 on the MTs. These DHC and p50 punctae on the MTs showed similar, highly processive movement on MTs. Based on comparison to inducible EGFP-tagged kinesin-1 intensity in Drosophila S2 cells, the authors concluded that the DHC and p50 punctae visualized represented 1 DHC-EGFP dimer+1 untagged DHC dimer and 1 p50-EGFP+3 untagged p50 molecules.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Strengths: 

      The work uses a simple and straightforward approach to address the question at hand: is dynein a processive motor in cells? Using a combination of TIRF and spinning disc confocal microscopy, the authors provide a clear and unambiguous answer to this question. 

      Thank you for the recognition of the strength of our work

      Weaknesses: 

      My only significant concern (which is quite minor) is that the authors focus their analysis on dynein movement in cells treated with docetaxol, which could potentially affect the observed behavior. However, this is likely necessary, as without it, motility would not have been observed due to the 'messiness' of dynein localization in a typical cell (e.g., plus end-tracking in addition to cargo transport).

      You are exactly correct that this treatment was required to provided us a clear view of motile dynein and p50 puncta. One concern about the treatment that we had noted in our original submission was that the docetaxel derivative SiR tubulin could increase microtubule detyrosination, which has been implicated in affecting the initiation of dynein-dynactin motility but not motility rates (doi: 10.15252/embj.201593071). In response to a comment from reviewer 2 we investigated whether there was a significant increase in alpha-tubulin detyrosination in our treatment conditions and found that there was not. We have removed the discussion of this possibility from the revised version. Please also see response to comments raised by reviewer 2. 

      Reviewer 1 (Recommendations for the authors):

      Major points: 

      (1) The authors measured kinesin-1-GFP intensities in a different cell line (drosophila S2 cells) than what was used for the DHC and p50 measurements (HeLa cells). It is unclear if this provides a fair comparison given the cells provide different environments for the GFP. Although the differences may in fact be trivial, without somehow showing this is indeed a fair comparison, it should at least be noted as a caveat when interpreting relative intensity differences. Alternatively, the authors could compare DHC and p50 intensities to those measured from HeLa cells treated with taxol. 

      Thank you for this suggestion. We conducted new rounds of imaging with the DHCEGFP and p50-EGFP clones in conjunction with HeLa cells transiently expressing the human kinesin-1-EGFP and now present the datasets from the new experiments. Importantly, our new data was entirely consistent with the prior analyses as there was not a significant difference between the kinesin-1-EGFP dimer intensities and the DHC-EGFP puncta intensities and there was a statistically significant difference in the intensity of p50 puncta, which were approximately half the intensity of the kinesin-1 and DHC. We have moved the old data comparing the intensities in S2 cells expressing kinesin-1-EGFP to Figure 3 - figure supplement 2 A-D and the new HeLa cell data is now shown in Figure 3 D-G.

      (2) Given the low number of observations (41-100 puncta), I think a scatter plot showing all data points would offer readers a more transparent means of viewing the single-molecule data presented in Figures 3A, B, C, and G. I also didn't see 'n' values for plots shown in Figure 3. 

      The box and whisker plots have now been replaced with scatter plots showing all data points. The accompanying ‘n’ values have been included in the figure 3 legend as well as the histograms in figures 1 and 2 that are represented in the comparative scatter plots.  

      (3) Given the authors have produced a body of work that challenges conclusions from another pre-print (Tirumala et al., 2022 bioRxiv) - specifically, that dynein is not processive in cells - I think it would be useful to include a short discussion about how their work challenges theirs. For example, one significant difference between the two experimental systems that may account for the different observations could simply be that the authors of the Tirumala study used a mouse DHC (in HeLa cells), which may not have the ability to assemble into active and processive dynein-dynactin-adaptor complexes. 

      Thank you for pointing this out! At the time we submitted our manuscript we were conflicted about citing a pre-print that had not been peer reviewed simply to point out the discrepancy. If we had done so at that time we would have proposed the exact potential technical issue that you have proposed here. However, at the time we felt it would be better for these issues to be addressed through the review process. Needless to say, we agree with your interpretation and now that the work is published (Tirumala et al. JCB, 2024) it is entirely appropriate to add a discussion on Tirumala et al. where contradictory observations were reported. 

      The following statement has been added to the manuscript: 

      “In contrast, a separate study (Tirumala et al., 2024) reported that dynein is not highly processive, typically exhibiting runs of very short duration (~0.6 s) in HeLa cells. A notable technical difference that may account for this discrepancy is that our study visualizes endogenously tagged human DHC, whereas Tirumala et al. characterized over-expressed mouse DHC in HeLa cells. Over-expression of the DHC may result in an imbalance of the subunits that comprise the active motor complex, leading to inactive, or less active complexes. Similarly, mouse DHC may not have the ability to efficiently assemble into active and processive dynein-dynactin-adaptor complexes to the same extent as human DHC.”

      Minor points: 

      (1) "Specifically, the adaptor BICD2 recruited a single dynein to dynactin while BICDR1 and HOOK3 supported assembly of a "double dynein" complex." It would be more accurate to say that dynein-dynactin complexes assembled with Bicd2 "tend to favor single dynein, and the Bicdr1 and Hook3 tend to favor two dyneins" since even Bicd2 can support assembly of 2 dynein-1 dynactin complexes (see Urnavicius et al, Nature 2018). 

      Thank you, the manuscript has been edited to reflect this point. 

      (2) "Human HeLa cells were engineered using CRISPR/Cas9 to insert a cassette encoding FKBP and EGFP tags in the frame at the 3' end of the dynein heavy chain (DYNC1H1) gene (SF1)." It is unclear to what "SF1" is referring. 

      SF1 is supplementary figure 1, which we have now clarified as being Figure 1 – figure supplement 1A.

      (3) "The SiR-Tubulin-treated cells were subjected to two-color TIRFM to determine if the DHC puncta exhibited motility and; indeed, puncta were observed streaming along MTs..." This sentence is strangely punctuated (the ";" is likely a typo?). 

      Thank you for pointing this out, the typo has been corrected and the sentence now reads:

      “The SiR-Tubulin-treated cells were subjected to two-color TIRFM and DHC-EGFP puncta were clearly observed streaming on Sir-Tubulin labeled MTs, which was especially evident on MTs that were pinned between the nucleus and the plasma membrane (Video 3)”

      (4) I am unfamiliar with the "MK" acronym shown above the molecular weight ladders in Figure 3H and I. Did the authors mean to use "MW" for molecular weight? 

      We intended this to mean MW and the typo has been corrected.

      (5) "This suggests that the cargos, which we presume motile dynein-dynactin puncta are bound to, any kinesins..." This sentence is confusing as written. Did the authors mean "and kinesins"? 

      Agreed. We have changed this sentence to now read: 

      “The velocity and low switching frequency of motile puncta suggest that any kinesin motors associated with cargos being transported by the dynein-dynactin visualized here are inactive and/or cannot effectively bind the MT lattice during dynein-dynactin-mediated transport in interphase HeLa cells.”

      Reviewer 2 (Recommendations for the authors):

      (1) I am confused as to why the authors introduced an FKBP tag to the DHC and no explanation is given. Is it possible this tag induces artificial dimerization of the DHC? 

      FKBP was tagged to DHC for potential knock sideways experiments. Since the current cell line does not express the FKBP counterpart FRB, having FKBP alone in the cell line would not lead to artificial dimerization of DHC.

      (2) The authors use a high concentration of SiR-tubulin (1uM) before washing it out. However, they observe strong effects on MT dynamics. The manufacturer states that concentrations below 100nM don't affect MT dynamics, so I am wondering why the authors are using such a high amount that leads to cellular phenotypes. 

      We would like to note that in our hands even 100 nM SiR-tubulin impacted MT dynamics if it was incubated for enough time to get a bright signal for imaging, which makes sense since drugs like docetaxel and taxol become enriched in cells over time. Thus, it was a trade-off between the extent/brightness of labeling and the effects on MT dynamics. We opted for shorter incubation with a higher concentration of Sir-Tubulin to achieve rapid MT labeling and efficient suppression of plus-end MT polymerization. This approach proved useful for our needs since the loss of the tip-tacking pool of DHC provided a clearer view of the motile population of MT-associated DHC.

      (3) The individual channels should be labeled in the supplemental movies. 

      They have now been labelled.

      (4) I would like to see example images and kymographs of the GFP-Kinesin-1 control used for fluorescent intensity analysis. Further, the authors use the mean of the intensity distribution, but I wonder why they don't fit the distribution to a Gaussian instead, as that seems more common in the field to me. Do the data fit well to a Gaussian distribution? 

      Example images and kymographs of the kinesin-1-EGFP control HeLa cells used for the updated fluorescent intensity analysis have been now added to the manuscript in Figure 3 - figure supplement 1. The kinesin-1-EGFP transiently expressed in HeLa cells exhibited a slower mean velocity and run length than the endogenously tagged HeLa dynein-dynactin. Regarding the distribution, we applied 6 normality tests to the new datasets acquired with DHC and p50 in comparison to human kinesin-EGFP in HeLa cells. While we are confident concluding that the data for p50 was normally distributed (p > 0.05 in 6/6), it was more difficult to reach conclusions about the normality of the datasets for kinesin-1 (p > 0.05 in 4/6) and DHC (p > 0.5 in 1/6). We have decided to report the data as scatter plots (per the suggestion in major point 1 by reviewer 1) in the new Figure 3G since it could be misleading to fit a non-normal distribution with a single Gaussian. We note that the likely non-normal distribution of the DHC data (since it “passed” only 1/6 normality tests) could reflect the presence of other populations (e.g. 1 DHC-EGFP in a motile puncta), but we could also not confidently conclude this since attempting to fit the data with a double Gaussian did not pass statistical muster. Indeed, as stated in the text, on lines 197-198 we do not exclude that the range of DHC intensities measured here may include sub-populations of complexes containing a single dynein dimer with one DHC-EGFP molecule.   

      Ultimately, we feel the safest conclusion is that there was not a statically significant difference between the DHC and kinesin-1 dimers (p = 0.32) but there was a statistically significant difference between both the DHC and kinesin-1 dimers compared to the p50 (p values < 0.001), which was ~50% the intensity of DHC and kinesin-1. Altogether this leads us to the fairly conservative conclusion that DHC puncta contain at least one dimer while the p50 puncta likely contain a single p50-EGFP molecule. 

      (5) The authors suggest the microtubules in the cells treated with SiR-tubulin may be more detyrosinated due to the treatment. Why don't they measure this using well-characterized antibodies that distinguish tyrosinated/detyrosinated microtubules in cells treated or not with SiR-tubulin? 

      At your suggestion, we carried out the experiment and found that under our labeling conditions there was not a notable difference in microtubule detyrosination between DMSO- and SiR-Tubulin-treated cells. Thus, we have removed this caveat from the revised manuscript.

      (6) "While we were unable to assess the relative expression levels of tagged versus untagged DHC for technical reasons." Please describe the technical reasons for the inability to measure DHC expression levels for the reader.

      We made several attempts to quantify the relative amounts of untagged and tagged protein by Western blotting. The high molecular weight of DHC (~500kDa) makes it difficult to resolve it on a conventional mini gel. We attempted running a gradient mini gel (4%-15%), and doing a western blot; however, we were still unable to detect DHC. To troubleshoot, the experiments were repeated with different dilutions of a commercially available antibody and varying concentrations of cell lysate; however, we were unable to obtain a satisfactory result. 

      We hold the view that even if it had it worked it would have been difficult to detect a relatively small difference between the untagged (MW = 500kDa) and tagged DHC (MW = 527kDa) by western blot. We have added language to this effect in the revised manuscript. 

      Reviewer #3 (Public Review):

      (1). CRISPR-edited HeLa clones: 

      (i) The authors indicate that both the DHC-EGFP and p50-EGFP lines are heterozygous and that the level of DHC-EGFP was not measured due to technical difficulties. However, quantification of the relative amounts of untagged and tagged DHC needs to be performed - either using Western blot, immunofluorescence or qPCR comparing the parent cell line and the cell lines used in this work. 

      See response to reviewer 2 above. 

      (ii) The localization of DHC predominantly at the plus tips (Fig. 1A) is at odds with other work where endogenous or close-to-endogenous levels of DHC were visualized in HeLa cells and other non-polarized cells like HEK293, A-431 and U-251MG (e.g.: OpenCell (https://opencell.czbiohub.org/target/CID001880), Human Protein Atlas  ), https://www.biorxiv.org/content/10.1101/2021.04.05.438428v3). The authors should perform immunofluorescence of DHC in the parental cells and DHC-EGFP cells to confirm there are no expression artifacts in the latter. Additionally, a comparison of the colocalization of DHC with EB1 in the parental and DHC-EGFP and p50-EGFP lines would be good to confirm MT plus-tip localisation of DHC in both lines. 

      The microtubule (MT) plus-tip localization of DHC was already observed in the 1990s, as evidenced by publications such as (PMID:10212138) and (PMID:12119357), which were further confirmed by Kobayashi and Murayama  in 2009 (PMID:19915671). We hold the view that further investigation into this localization is not worthwhile since the tip-tracking behavior of DHC-dynactin has been long-established in the field.

      (iii) It would also be useful to see entire fields of view of cells expressing DHC-EGFP and p50EGFP (e.g. in Spinning Disk microscopy) to understand if there is heterogeneity in expression. Similarly, it would be useful to report the relative levels of expression of EGFP (by measuring the total intensity of EGFP fluorescence per cell) in those cells employed for the analysis in the manuscript. 

      Representative images of fields have been added as Figure 1 - figure supplement 1B and Figure 2 – figure supplement 1 in the revised manuscript. We did not see drastic cell-tocell variation of expression within the clonal cell lines.

      (iv) Given that the authors suspect there is differential gene regulation in their CRISPR-edited lines, it cannot be concluded that the DHC-EGFP and p50-EGFP punctae tracked are functional and not piggybacking on untagged proteins. The authors could use the FKBP part of the FKBPEGFP tag to perform knock-sideways of the DHC and p50 to the plasma membrane and confirm abrogation of dynein activity by visualizing known dynein targets such as the Golgi (Golgi should disperse following recruitment of EGFP-tagged DHC-EGFP or p50-EGFP to the PM), or EGF (movement towards the cell center should cease). 

      Despite trying different concentrations and extensive troubleshooting, we were not able to replicate the reported observations of Ciliobrevin D or Dynarrestin during mitosis. We would like to emphasize that the velocity (1.2 μm/s) of dynein-dynactin complexes that we measured in HeLa cells was comparable to those measured in iNeurons by Fellows et al. (PMID: 38407313) and for unopposed dynein under in vitro conditions. 

      (2) TIFRM and analysis: 

      (i) What was the rationale for using TIRFM given its limitation of visualization at/near the plasma membrane? Are the authors confident they are in TIRF mode and not HILO, which would fit with the representative images shown in the manuscript? 

      To avoid overcrowding, it was important to image the MT tracks that that were pinned between the nucleus and the plasma membrane. It is unclear to us why the reviewer feels that true TIRFM could not be used to visualize the movement of dynein-dynactin on this population of MTs since the plasma membrane is ~ 3-5 nm and a MT is ~25-27 nm all of which would fall well within the 100-200 nm excitable range of the evanescent wave produced by TIRF. While we feel TIRF can effectively visualize dynein-dynactin motility in cells, we have mentioned the possibility that some imaging may be HILO microscopy in the materials and methods.

      (ii) At what depth are the authors imaging DHC-EGFP and p50-EGFP? 

      The imaging depth of traditional TIRFM is limited to around 100-200 nm. In adherent interphase HeLa cells the nucleus is in very close proximity (nanometer not micron scale) to the plasma membrane with some cytoskeletal filaments (actin) and microtubules positioned between the plasma membrane and the nuclear membrane. The fact that we were often visualizing MTs positioned between the nucleus and the membrane makes us confident that we were imaging at a depth (100 - 200nm) consistent with TIRFM. 

      (iii) The authors rely on manual inspection of tracks before analyzing them in kymographs - this is not rigorous and is prone to bias. They should instead track the molecules using single particle tracking tools (eg. TrackMate/uTrack), and use these traces to then quantify the displacement, velocity, and run-time. 

      Although automated single particle tracking tools offer several benefits, including reduced human effort, and scalability for large datasets, they often rely on specialized training datasets and do not generalize well to every dataset. The authors contend that under complex cellular environments human intervention is often necessary to achieve a reliable dataset. Considering the nature of our data we felt it was necessary to manually process the time-lapses. 

      (iv) It is unclear how the tracks that were eventually used in the quantification were chosen. Are they representative of the kind of movements seen? Kymographs of dynein movement along an entire MT/cell needs to be shown and all punctae that appear on MTs need to be tracked, and their movement quantified. 

      Considering the densely populated environment of a cell, it will be nearly impossible to quantity all the datasets. We selected tracks for quantification, focusing on areas where MTs were pinned between the nucleus and plasma membrane where we could track the movement of a single dynein molecule and where the surroundings were relatively less crowded. 

      (v) What is the directionality of the moving punctae? 

      In our experience, cells rarely organized their MTs in the textbook radial MT array meaning that one could not confidently conclude that “inward” movements were minus-end directed. Microtubule polarity was also not able to be determined for the MTs positioned between the plasma membrane and the nucleus on which many of the puncta we quantified were moving. It was clear that motile puncta moving on the same MT moved in the same direction with the exception of rare and brief directional switching events. What was more common than directional switching on the same MT were motile puncta exhibiting changes in direction at sharp (sometimes perpendicular) angles indicative of MT track switching, which is a well-characterized behavior of dynein-dynactin (See DOI: 10.1529/biophysj.107.120014).

      (vi) Since all the quantification was performed on SiR tubulin-treated cells, it is unclear if the behavior of dynein observed here reflects the behavior of dynein in untreated cells. Analysis of untreated cells is required. 

      It was important to quantify SiR tubulin-treated cells because SiR-Tubulin is a docetaxel derivative, and its addition suppressed plus-end MT polymerization resulting in a significant reduction in the DHC tip-tracking population and a clearer view of the motile population of MT-associated DHC puncta. Otherwise, it was challenging to reliably identify motile puncta given the abundance of DHC tip-tracking populations in untreated cells.  

      (3) Estimation of stoichiometry of DHC and p50 

      Given that the punctae of DHC-EGFP and p50 seemingly bleach on MT before the end of the movie, the authors should use photobleaching to estimate the number of molecules in their punctae, either by simple counting the number of bleaching steps or by measuring single-step sizes and estimating the number of molecules from the intensity of punctae in the first frame. 

      Comparing the fluorescence intensity of a known molecule (in our case a kinesin-1EGFP dimer) to calculate the numbers of an unknown protein molecule (in our case Dynein or p50) is a widely accepted technique in the field. For example, refer to PMID: 29899040. To accurately estimate the stoichiometry of DHC and p50 and address the concerns raised by other reviewers, we expressed the human kinesin-EGFP in HeLa cells and analyzed the datasets from new experiments. We did not observe any significant differences between our old and new datasets.

      (4) Discussion of prior literature 

      Recent work visualizing the behavior of dyneins in HeLa cells (DOI:  10.1101/2021.04.05.438428), which shows results that do not align with observations in this manuscript, has not been discussed. These contradictory findings need to be discussed, and a more objective assessment of the literature in general needs to be undertaken.

    1. eLife Assessment

      This valuable manuscript presents a potentially novel mechanism by which the phospholipid scramblase, PLSCR1, defends against influenza A virus infection. The strength of the paper rests on solid findings involving knockout and lung specific over-expressing Plscr1 mice, airway tissue expression and mechanistic studies to show Plscr1 enhances type III interferon-mediated viral clearance.

    2. Reviewer #1 (Public review):

      This manuscript by Yang et al. presents a potentially novel mechanism by which Plscr1 defends against influenza virus infection. Using a global knockout (KO) and a tissue-specific overexpression mouse model, the authors demonstrate that Plscr1-KO mice exhibit increased susceptibility and inflammation following IAV infection. In contrast, overexpression of Plscr1 in ciliated epithelial cells protects mice from infection. Through transcriptomic analysis in mice and mechanistic studies in cell culture models, the authors reveal that Plscr1 transcriptionally upregulates Ifnlr1 expression and physically interacts with this receptor on the plasma membrane, thereby enhancing IFN-λ-mediated viral clearance.

      Overall, it's a well-performed study, however, causality between Plscr1 and Ifnlr1 expression needs to be more firmly established. This is because two recent studies of PLSCR1 KO cells infected with different viruses found no major differences in gene expression levels compared with their WT controls (Xu et al. Nature, 2023; LePen et al. PLoS Biol, 2024). There were also defects in the expression of other cytokines (type I and II IFNs plus TNF-alpha) so a clear explanation of why Ifnlr1 was chosen should also be given.

      While Plscr1 has long been recognized as a cell-intrinsic antiviral restriction factor, few studies have explored its broader physiological role. This study thus provides interesting insights into a specific function of Plscr1 in IAV-permissive airway epithelial cells and its contribution to whole body anti-viral immunity.

      Comments on revisions:

      Most of the requested changes and experiments have been done. One very informative experiment is the expression of Plscr1 in Ifnlr1-KO cells to determine if it still inhibits IAV infection. The authors have indicated that this experiment is currently being pursued by crossing mice to introduce Plscr1 expression into ciliated epithelial cells on an Ifnlr1 KO background. It will show if there are Ifnlr1-independent anti-flu activities that still require Plscr1.

    3. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public Review):

      Overall, it's a well-performed study, however, causality between Plscr1 and Ifnlr1 expression needs to be more firmly established. This is because two recent studies of PLSCR1 KO cells infected with different viruses found no major differences in gene expression levels compared with their WT controls (Xu et al. Nature, 2023; LePen et al. PLoS Biol, 2024). There were also defects in the expression of other cytokines (type I and II IFNs plus TNF-alpha) so a clear explanation of why Ifnlr1 was chosen should also be given.

      We appreciate the reviewer’s reference to the two recently published research on PLSCR1’s role in SARS-CoV-2 infections. We have also discussed those studies in the Introduction and Discussion sections of this manuscript. Here, we would like to clarify ourselves for the rationale of investigating Ifn-λr1 signaling.

      The reviewer mentioned “defects in the expression of other cytokines (type I and II IFNs plus TNF-alpha)” and requested a clearer explanation of why Ifnlr1 was chosen for study. In our investigation of IAV infection, we observed no defects in the expression of type I and II IFNs or TNF-α in Plscr1<sup>-/-</sup> mice; rather, these cytokines were expressed at even higher levels compared to WT controls (Figures 2D and 3A). This indicates that the type I and II IFN and TNF-α signaling pathways remain intact and are not negatively affected by the loss of Plscr1. Notably, Ifn-λr1 expression is the only one among all IFNs and their receptors that is significantly impaired in Plscr1<sup>-/-</sup> mice (Figure 3A), justifying our focused investigation of this receptor. To further clarify this point, we have expanded the explanation under the section titled “Plscr1 Binds to Ifn-λr1 Promoter and Activates Ifn-λr1 Transcription in IAV Infection” within the Results. The reviewer noted that previously published studies “found no major differences in gene expression levels compared with their WT controls”, but neither study examined Ifn-λr1 expression.

      (1) The authors propose that Plscr1 restricts IAV infection by regulating the type III IFN signaling pathway. While the data show a positive correlation between Ifnlr1 and Plscr1 levels in both mouse and cell culture models, additional evidence is needed to establish causality between the impaired type III IFN pathway, and the increased susceptibility observed in Plscr1-KO mice. To strengthen this conclusion, the following experiments could be undertaken: (i) Measure IAV titers in WT, Plscr1-KO, Ifnlr1-KO, and Plscr1/ Ifnlr1-double KO cells. If the antiviral activity of Plscr1 is highly dependent on Ifnlr1, there should be no further increase in IAV titers in double KO cells compared to single KO cells; (ii) over-express Plscr1 in Ifnlr1-KO cells to determine if it still inhibits IAV infection. If Plscr1's main action is to upregulate Ifnlr1, then it should not be able to rescue susceptibility since Ifnlr1 cannot be expressed in the KO background. If Plscr1 over-expression rescues viral susceptibility, then there are Ifnlr1-independent mechanisms involved. These experiments should help clarify the relative contribution of the type III IFN pathway to Plscr1-mediated antiviral immunity.

      We agree with the reviewer that additional evidence is necessary to establish causality between the impaired type III IFN pathway and the increased susceptibility observed in Plscr1-KO mice. As requested by the reviewer, and one step further, we have measured IAV titers in Wt, Plscr1<sup>-/-</sup>, Ifn-λr1<sup>-/-</sup>, and Plscr1<sup>-/-</sup>Ifn-λr1<sup>-/-</sup> mouse lungs, which provided us with more comprehensive information at the tissue and organismal level compared to cell culture models. Our results are detailed under “The Anti-Influenza Activity of Plscr1 Is Highly Dependent on Ifn-λr1” within “Results” section and in Supplemental Figure 5. Importantly, there was no further increase in weight loss (Supplemental Figure 5B), total BAL cell counts (Supplemental Figure 5C), neutrophil percentages (Supplemental Figure 5D), and IAV titers (Supplemental Figure 5E) in Plscr1<sup>-/-</sup>Ifn-λr1<sup>-/-</sup> mouse lungs compared to Ifn-λr1<sup>-/-</sup> mouse lungs. These findings indicate that the antiviral activity of Plscr1 is largely dependent on Ifn-λr1.

      We agree that overexpression of Plscr1 on an Ifn-λr1<sup>-/-</sup> background would provide additional evidence to support our conclusion from the Plscr1<sup>-/-</sup>Ifn-λr1<sup>-/-</sup> mice. In future studies, we plan to specifically overexpress Plscr1 in ciliated epithelial cells on the Ifn-λr1<sup>-/-</sup> background by breeding Plscr1<sup>floxStop</sup>Foxj1-Cre<sup>+</sup>Ifn-λr1<sup>-/-</sup> mice. In addition, ciliated epithelial cells isolated from Ifn-λr1<sup>-/-</sup> murine airways could be transduced with a Plscr1 construct for overexpression. We hypothesize that overexpression of Plscr1 in ciliated epithelial cells will not rescue susceptibility in Ifn-λr1<sup>-/-</sup> mice or cells, since our Plscr1<sup>-/-</sup>Ifn-λr1<sup>-/-</sup> mouse model suggest that Ifn-λr1-independent anti-influenza functions of Plscr1 are likely minor compared to its role in upregulating Ifn-λr1. These future plans have been added to the “Discussion” section, and we look forward to presenting our results in a forthcoming publication.

      (3) In Figure 4, the authors demonstrate the interaction between Plscr1 and Ifnlr1. They suggest that this interaction modulates IFN-λ signaling. However, Figures 5C-E show that the 5CA mutant, which lacks surface localization and the ability to bind Ifnlr1, exhibits similar anti-flu activity to WT Plscr1. Does this mean the interaction between Plscr1 and Ifnlr1 is dispensable for Plscr1-mediated antiviral function? Can the authors compare the activation of IFN-λ signaling pathway in Plscr1-KO cells expressing empty vector, WT Plscr1, and 5CA mutant? This could be done by measuring downstream ISG expression or using an ISRE-luciferase reporter assay upon IFN-λ treatment.

      We agree with the reviewer that downstream activation of the IFN-λ signaling pathway is a critical component of the proposed regulatory role of PLSCR1. As suggested, we attempted to perform an ISRE-luciferase reporter assay following IFN-λ treatment in PLSCR1 rescue cell lines by transfecting the cells with hGAPDH-rLuc (Addgene #82479) and pGL4.45 [luc2P/ISRE/Hygro] (Promega #E4041).

      Despite extensive efforts over several months, we were unable to achieve expression of pGL4.45 [luc2P/ISRE/Hygro] in PLSCR1 rescue cells using either Lipofectamine 3000 or electroporation, as no firefly luciferase activity was detected at baseline or following IFN-λ treatment. In contrast, hGAPDH-rLuc was robustly expressed in these cells.

      The pGL4.45 [luc2P/ISRE/Hygro] plasmid was obtained directly from Promega as a purified product, and its sequence was confirmed via whole plasmid sequencing. Additionally, both hGAPDH-rLuc and pGL4.45 [luc2P/ISRE/Hygro] were successfully expressed in 293T cells, indicating that neither the plasmids nor the transfection protocols are inherently faulty.

      We suspect that prior modifications to the PLSCR1 rescue cells—such as CRISPR-mediated knockout and lentiviral transduction—may interfere with successful transfection of pGL4.45 [luc2P/ISRE/Hygro] through an as-yet-unknown mechanism. Although these results are disappointing, we will continue troubleshooting and plan to communicate in a separate manuscript once the luciferase assay is successfully established.

      Reviewer #1 (Recommendations):

      (1) In the introduction, the linkage between the paragraph discussing type III IFN and PLSCR1 needs to be better established. The mention of PLSCR1 being an ISG at the outset may help connect these two paragraphs and make the text appear more logical.

      We apologize for the lack of linkage and logic between type 3 IFN and PLSCR1. We have introduced PLSCR1 as an ISG at the beginning of its paragraph as recommended. 

      (2) The statement that, “Intriguingly, PLSCR1 is also an antiviral ISG, as its expression can be highly induced by type 1 and 2 interferons in various viral infections[15, 16]. However, whether its expression can be similarly induced by type 3 interferon has not been studied yet.” is incorrect. Xu et al. tested the role of PLSCR1 in type III IFN-induced control of SARS-CoV-2 (ref. 24). This needs to be revised.

      We apologize for the incorrect information in the introduction and have revised the paragraph with the proper citation.

      (3) In Figure 3B, can the authors provide a comprehensive heatmap that includes all ISGs above the threshold, rather than only a subset? This would offer a more complete overview of the changes in type I, II, and III IFN pathways in Plscr1-KO mice.

      As suggested by the reviewer, we have provided a comprehensive heatmap that includes all ISGs above the threshold in Figure 3C (previously Figure 3B). We identified a total of 1,113 ISGs in our dataset with a fold change ≥2. Enlarged heatmaps with gene names are provided in Supplemental Figure 1. Among those ISGs, 584 are regulated exclusively by type 1 IFNs, and 488 are regulated by both type 1 and type 2 interferons. Unfortunately, the Interferome database does not include information on type 3 IFN-inducible genes in mice[1]. Although many ISGs were robustly upregulated in Plscr1<sup>-/-</sup> infected lungs, consistent with inflammation data, a large subset of ISGs failed to be transcribed when Ifn-λr1 function was impaired, especially at 7 dpi. We suspect that those non-transcribed ISGs in Plscr1<sup>-/-</sup> mice may be specifically regulated by type 3 IFN and represent interesting targets for future research. These results have been added to “Plscr1 Binds to Ifn-λr1 Promoter and Activates Ifn-λr1 Transcription in IAV Infection” within “Results” section.

      (4) In Figure 3C, 5B and 7H, immunoblots should also be included to measure changes of Ifnlr1/IFNLR1 protein level.

      As requested by the reviewer, we have provided western blots measuring Ifn-λr1/IFN-λR1 protein level in Figure 5B and 7I. The protein expressions were consistent with the PCR results.

      (5) In Figure 3H, the amount of RPL30 is also low in the anti-PLSCR1-treated and IgG samples, making it difficult to estimate if ChIP binding is genuinely impacted.

      RPL30 Exon 3 serves as a negative control in the ChIP experiment and is not expected to bind either the anti-PLSCR1-treated or the IgG control samples. Anti-Histone H3 treatment is a positive control, with the treated sample expected to show binding to RPL30 Exon 3. We hope this clarification has addressed any further potential confusion from the reviewer.

      (6) In Figure 4A, can the authors show a larger slice of the gel with molecular weight markers for both Plscr1 and Ifnlr1. In the coIP, the binding may be indirect through intermediate partners. Proximity ligation assay is a more direct assay for interaction and can be stated as such.

      As suggested by the reviewer, we have included whole gel images of Figure 4A with molecular weight markers for both Plscr1 and Ifnlr1 in Supplemental Figure 3. We appreciate the reviewer’s affirmation of proximity ligation assay and have stated it as a more direct assay for interaction under “Plscr1 Interacts with Ifn-λr1 on Pulmonary Epithelial Cell Membrane in IAV Infection” in “Results” section.

      (7) In Figure 5A, how is the expression of PLSCR1 WT and mutants driven by an EF-1α promoter can be further upregulated by IAV infection? Can the authors also use immunoblots to examine the protein level of PLSCR1?

      We apologize for the confusion and appreciate the reviewer’s careful observation. We were initially surprised by this finding as well, but upon further investigation, we found out that the human PLSCR1 primers used in our qRT-PCR assay can still detect the transcription from the undisturbed portion of the endogenous PLSCR1 mRNA, even in PLSCR1<sup>-/-</sup> cells. In the original Figure 5A, data for vector-transduced PLSCR1<sup>-/-</sup> were not included because PCR was not performed on those samples at the time. After conducting PCR for vector-transduced PLSCR1<sup>-/-</sup> cells, we detected transcription of PLSCR1, which confirms that the signaling originates from endogenous DNA, but not from the EF-1α promoter-driven PLSCR1 plasmid. Please see Author response image 1 below.

      Author response image 1.

      The forward human PLSCR1 primer we used matches 15-34 nt of Wt PLSCR1, and the reverse primer matches 224-244 nt of Wt PLSCR1. CRISPR-Cas9 KO of PLSCR1 was mediated by sgRNAs in A549 cells and was performed by Xu et al[2]. sgRNA #1 matches 227-246 nt, sgRNA #2 matches 209-228 nt, and sgRNA #3 matches 689-708 nt of Wt PLSCR1. The sgRNAs likely introduced a short deletion or insertion that does not affect transcription. However, those endogenous mRNA transcripts cannot be translated to functional and detectable PLSCR1 proteins, as validated by our western blot (below), as well as western blots performed by Xu et al[2]. Therefore, our primers could amplify endogenous PLSCR1 transcripts upregulated by IAV infection, if 15-244 nt was not disturbed by CRISPR-Cas9 KO. By western blot, we confirmed that only endogenous PLSCR1 expression is upregulated by IAV infection, and exogenous protein expression of PLSCR1 plasmids driven by an EF-1α promoter are not upregulated by IAV infection.

      Author response image 2.

      To avoid confusion, we have removed the original Figure 5A from the manuscript.

      (8) In Figure 5C, the loss of anti-flu activity with the H262Y mutant is modest, suggesting the loss of ifnlr1 transcription is only partly responsible for the susceptibility of Plscr1 KO cells. The anti-flu activity being independent of scramblase activity resembles the earlier discovery of SARS-CoV-2 (Xu et al., 2024). This could be stated in the results since it is an important point that scramblase activity is dispensable for several major human viruses and shifts the emphasis regarding mechanism. It has been appropriately noted in the discussion.

      We appreciated the comments and have acknowledged the consistency of our results with those of Xu et al. under “Both Cell Surface and Nuclear PLSCR1 Regulates IFN-λ Signaling and Limits IAV Infection Independent of Its Enzymatic Activity” in the “Results” section.

      Reviewer #2 (Recommendations):

      (1) The statement that type I interferons are expressed by “almost all cells” is inaccurate (line 61). Type I IFN production is also context-dependent and often restricted to specific cell types upon infection or stimulation.

      We apologize for the inaccurate description of the expression pattern of type 1 IFNs and have corrected the restricted cellular sources of type 1 IFNs in the “Introduction”.

      (2) The antiviral response is assessed solely through flu M gene expression. Incorporating infectious virus titers (e.g., TCID50 or plaque assay) would provide a more robust and direct measure of antiviral activity.

      As requested by the reviewer, we have performed plaque assays on all experiments where flu M gene expression levels were measured (Figure 1G, 5E and 7F, and Supplemental Figure 6E). The plaque assay results are consistent with the flu M gene expressions.

      (3) While mRNA expression of interferons is measured, protein levels (e.g., through ELISA) should also be quantified to establish the functional relevance of IFN expression changes.

      As requested by the reviewer, we have quantified the protein level of IFN-λ in mouse BAL with ELISA (Figure 2E). The ELISA results are consistent with the mRNA expressions of IFN-λ.

      (4) It is unclear whether reduced IFNLR1 expression translates to defective downstream signaling or antiviral responses after IFN-λ treatment in PLSCR1-deficient cells. This is particularly pertinent given the increase in IFN-λ ligand in vivo, which might compensate for receptor downregulation.

      We agree with the reviewer that downstream activation of the IFN-λ signaling pathway is a critical aspect of PLSCR1’s proposed regulatory role. To investigate this, we attempted an ISRE-luciferase reporter assay to assess downstream signaling following IFN-λ treatment in PLSCR1 rescue cells. Unfortunately, the experiment encountered unforeseen technical issues. For additional context, please refer to our response to Reviewer #1’s public review #3.

      (5) Detailed gating strategies for immune cell subsets are absent and should be included for clarity and reproducibility.

      We would like to clarify that the immune cell subsets in BAL fluids were counted manually following cytospin preparation and Diff-Quik staining (Figure 2B and 7H, and Supplemental Figures 2C, 5D, and 8D), rather than by flow cytometry. We hope this resolves the reviewer’s confusion.

      (6) The study does not definitively establish that reduced IFN-λ signaling causes the observed in vivo phenotype. Increased morbidity and mortality in PLSCR1-deficient mice could also stem from elevated TNF-α levels and lung damage, as proinflammatory cytokines and/or enhanced lung damage are known contributors to influenza morbidity and mortality. This point warrants detailed discussions.

      We agreed with the reviewer that this study does not guarantee a definitive causality between reduced IFN-λ signaling and increased morbidity of Plscr1<sup>-/-</sup> mice and more experiments are needed to reach the conclusion. We have acknowledged this limitation of our study in the “Discussion”, as requested by the reviewer. We hope to fully eliminate the confounding elements and definitively establish the proposed causality in future studies.

      Reviewer #3 (Public review):

      Summary:

      Yang et al. have investigated the role of PLSCR1, an antiviral interferon-stimulated gene (ISG), in host protection against IAV infection. Although some antiviral effects of PLSCR1 have been described, its full activity remains incompletely understood.

      This study now shows that Plscr1 expression is induced by IAV infection in the respiratory epithelium, and Plscr1 acts to increase Ifn-λr1 expression and enhance IFN-λ signaling possibly through protein-protein interactions on the cell membrane.

      Strengths:

      The study sheds light on the way Ifnlr1 expression is regulated, an area of research where little is known. The study is extensive and well-performed with relevant genetically modified mouse models and tools.

      Weaknesses:

      There are some issues that need to be clarified/corrected in the results and figures as presented.

      Also, the study does not provide much information about the role of PLSCR1 in the regulation of Ifn-λr1 expression and function in immune cells. This would have been a plus.

      We would like to thank the reviewer for the positive feedback and insightful comment regarding the roles of PLSCR1 and IFN-λR1 in immune cells. It is important to note that IFN-λR1 expression is highly restricted in immune cells and is primarily limited to neutrophils and dendritic cells[3]. While dendritic cells were not the focus of this study, we did examine all immune cell subsets in our single cell RNA seq data and performed infection experiments in Plscr1<sup>floxStop</sup>/LysM-Cre<sup>+</sup> mice. We have not observed any significant findings in these populations. On the other hand, we do have some interesting preliminary data suggesting a role for PLSCR1 in regulating Ifn-λr1 expression and function in neutrophils. These findings are discussed in detail in our response to reviewer #3’s recommendation #12.

      Reviewer #3 (Recommendations):

      (1) In Figure 1B, the Plscr1 label should be moved to the y-axis so that readers don't confuse it with the Plscr1-/- mice used in the other figure panels. The fact that WT mice were used should be added in the figure legend.

      We apologize for the confusion in the figures. We have moved Plscr1 label to the y-axis in Figure 1B and have mentioned Wt mice were used in the figure legend.

      (2) In Figure 1C and D, the type of dose leading to the presented data should be added to help the reader. Also, shouldn't statistics be added?

      We appreciate the suggestion and have added doses to Figure 1C and 1D. We are confused about the request of adding statistics by the reviewer, as two-way ANOVA tests were used to compare weight losses, and the significance was labeled on the figures.

      (3) In Figures 1, F, and G, it is not indicated whether sublethal or lethal dose was used for the IAV infection. This should be very clear in the figure and figure legend.

      We apologize for the confusion of infection doses used in the figures. We have added doses to Figure 1F, 1G and 1H.

      (4) In Figure 1, the CTCF abbreviation should be explained in the Figure legend.

      We have explained CTCF in the figure legend as requested.

      (5) In Figure 2B, this is percentages of what?

      Figure 2B shows the percentages of each immune cell type within total BAL cells.

      (6) In Figures 3A and B, transcriptomes for each condition are from how many mice? Also, what do heatmaps show? Fold induction, differences, etc, and from what? What is compared with what? In addition, is there a discordance between the RNAseq data of Figure 3A and the qPCR data of Fig. 3C in terms of Ifnlr1 expression?

      In Figure 3A and 3C (previously 3B), RNA from the whole lungs of 9 mice per PBS-treated group and 4 mice per IAV-infected group were pooled for transcriptomic analysis. Figure 3A represents a heatmap of differential gene expression, while Figure 3C (previously 3B) represents fold changes in gene expression relative to uninfected controls. In both heatmaps, gene expression values are color-coded from row minimum (blue) to row maximum (red), enabling comparison across groups within each gene (row). The major comparison of interest in these heatmaps is between Wt infected mice versus Plscr1<sup>-/-</sup> infected mice. We have added this information to the figure legend.

      We also acknowledge the reviewer’s observation regarding the discordance between the RNA seq data of Figure 3A and the qPCR data of Figure 3B (previously 3C) for Ifnlr1 expression. To address this, we have repeated the qRT-PCR experiment with additional samples at 7 dpi. In the updated results, Wt mice consistently show significantly higher Ifn-λr1 expression than Plscr1<sup>-/-</sup> infected mice at both 3 dpi and 7 dpi, consistent with the RNA seq data. However, a time-dependent discrepancy between the RNA-seq and qRT-PCR datasets remains: Ifn-λr1 expression continues to increase at 7 dpi in the RNA-seq data (Figure 3A), whereas it declines in the qRT-PCR results (Figure 3B). The reason for this discrepancy remains unclear and has been addressed in the Discussion section.

      (7) In Figure 3D, have the authors checked whether the Ifnlr1 antibody they use is indeed specific for Ifnlr1? Have they used any blocking peptide for the anti-mouse Ifn-λr1 polyclonal antibody they are using? Also, in Figure 3E, the marker used for staining should be indicated in the pictures of the lung section.

      Unfortunately, a blocking peptide is not available for the anti-mouse Ifn-λr1 polyclonal antibody used in our study. To assess antibody specificity, we have performed immunofluorescence staining of Ifn-λr1 on lung tissues from Ifn-λr1<sup>-/-</sup> mice using the same antibody. No signal was detected (Supplemental Figure 5A), supporting the specificity of the antibody for Ifn-λr1.

      As requested by the reviewer, we have added the marker (Ifn-λr1) to the pictures of the lung section in Figure 3E.

      (8) In Figure 5, it's better to move each graph's label that stands to the top (e.g. PLSCR1, IFN-λR1 etc) to the y-axis label so that it doesn't get confused with the mouse -/- label.

      We apologize for the confusion and have moved the top label to the y-axis in Figure 5.

      (9) In Figure 6A, it is claimed that the 'two-dimensional UMAP demonstrated that these main lung cell populations (epithelial, endothelial, mesenchymal, and immune) were dynamic over the course of infection.'. This is not clear by the data. The percentage of cells per cluster should be calculated.

      As requested by the reviewer, the proportion (Supplemental Figure 6A) and cell count (Supplemental Figure 6B) of each cluster have been calculated and included in “PLSCR1 Expression Is Upregulated in the Ciliated Airway Epithelial Compartment of Mice following Flu Infection” under “Results” section. Together with the two-dimensional UMAP (Figure 6A), these data demonstrate that the main lung cell populations (epithelial, endothelial, mesenchymal, and immune) were dynamic over the course of infection. Following infection, many populations emerged, particularly within the immune cell clusters. At the same time, some clusters were initially depleted and later restored, such as microvascular endothelial cells (cluster 2). Other populations, such as interferon-responsive fibroblasts (cluster 20), showed a dramatic yet transient expansion during acute infection and disappeared after infection resolved.

      (10) In Figure 6 B and C, the legend should indicate that these are Violin plots. Also, if AT2 cells don't express Plscr1, does that indicate that in these cells Plscr1 is not needed for IFN-λR1 expression?

      As requested, we have indicated in the legend of Figure 6B and 6C that these are violin plots. Plscr1 is expressed at low levels in AT2 cells. However, it is unclear whether Plscr1 is needed for Ifn-λr1 expression in AT2 cells, and it would be interesting to investigate further.

      (11) In lines 302-304, it is stated that 'Among the various epithelial populations, ciliated epithelial cells not only had 303 the highest aggregated expression of Plscr1, but also were the only epithelial cell 304 population in which significantly more Plscr1 was induced in response to IAV infection.'. Which data/ figure support this statement?

      Figure 6B shows that among the various epithelial populations, ciliated epithelial cells had the highest aggregated expression of Plscr1. To better illustrate this statement, we have rearranged the order of cell clusters from highest to lowest Plscr1 expression, and added red dots to indicate the mean expression levels for each cluster in Figure 6B.

      Ciliated epithelial cells also had the most significant increase in Plscr1 expression (p < 2.22e-16 and p = 6.7e-05) in early IAV infection at 3 dpi (Figure 6C and Supplemental Figure 7A-7K). In comparison, AT1 cells were the only other epithelial cluster to show Plscr1 upregulation at 3dpi, but to a much less extent (p = 0.033, Supplemental Figure 7J). Supplemental Figure 7 was added to better support the statement and the explanation was added to “PLSCR1 Expression Is Upregulated in the Ciliated Airway Epithelial Compartment of Mice following Flu Infection” under “Results” section.

      (12) As earlier, if Plscr1 is not expressed in neutrophils (Figure 6F), does that mean IFN-λR1 expression does not require Plscr1 in these cells?

      Although Plscr1 is expressed at lower levels in neutrophils compared to epithelial cells, it is still detectable. In fact, our preliminary data suggest that IFN-λR1 expression in neutrophils is dependent on Plscr1. We have isolated neutrophils from peripheral blood and BAL of IAV-infected Wt and Plscr1<sup>-/-</sup> mice using a mouse neutrophil enrichment kit. Quantitative PCR results showed that Plscr1<sup>-/-</sup> neutrophils exhibit significantly lower expression of Ifn-λr1, alongside elevated levels of Il-1β, Il-6 and Tnf-α in IAV infection (see figures below). These findings suggest that Plscr1 may play an anti-inflammatory role in neutrophils by upregulating Ifn-λr1. These data were not included in the current manuscript because they are beyond the scope of current study, but we hope to address the role of PLSCR1 in regulating IFN-λR1 expression and function in neutrophils in a future study.

      Author response image 3.

      (13) The Figure 7A legend is not well stated. Something like ' Schematic representation of the experimental design of...' should be included. Also, Figure 7J is not referenced in the text.

      We apologize for the unclear Figure 7A legend and have changed it to “Schematic representation of the experimental design of ciliated epithelial cell conditional Plscr1 KI mice.” Figure 8 (previously Figure 7J) has now been referenced in the text.

      (14) In the Methods, more specific information in some parts should be provided. For example, the clones of the antibodies used should be included.

      Apart from the 10x technology, the kits used and the type of the Illumina sequencing should be provided. Information on how the QC was performed (threshold for reads/cell, detected genes/per cells, and % of mitochondrial genes etc) should be added.

      We apologize for the missing information in the “Methods”. We have now provided the clones of the antibodies used, the kit used to generate single-cell transcriptomic libraries, the type of the Illumina sequencing, and the QC performance data.

      References

      (1) Rusinova, I., et al., Interferome v2.0: an updated database of annotated interferon-regulated genes. Nucleic Acids Res, 2013. 41(Database issue): p. D1040-6.

      (2) Xu, D., et al., PLSCR1 is a cell-autonomous defence factor against SARS-CoV-2 infection. Nature, 2023. 619(7971): p. 819-827.

      (3) Donnelly, R.P., et al., The expanded family of class II cytokines that share the IL-10 receptor-2 (IL-10R2) chain. J Leukoc Biol, 2004. 76(2): p. 314-21.

    1. eLife Assessment

      This important study combines a comprehensive range of biophysical, kinetic, and thermodynamic techniques, together with high-quality experimental and computational analysis, to carry out a series of well-designed experiments to explore whether glutamine-binding protein binds glutamine via an induced fit or a conformational selection process. The evidence supporting the major conclusion of the work is compelling. The work will be of broad interest to biochemists and biophysicists.

    2. Reviewer #1 (Public review):

      Here the authors discuss mechanisms of ligand binding and conformational changes in GlnBP (a small E Coli periplasmic binding protein, which binds and carries L-glutamine to the inner membrane ATP-binding cassette (ABC) transporter). The authors have distinguished records in this area and have published seminal works. They include experimentalists and computational scientists. Accordingly, they provide a comprehensive, high quality, experimental and computational work.

      They observe that apo- and holo- GlnBP do not generate detectable exchange between open and (semi-) closed conformations on timescales between 100 ns and 10 ms. Especially, the ligand binding and conformational changes in GlnBP that they observe are highly correlated. Their analysis of the results indicates a dominant induced-fit mechanism, where the ligand binds GlnBP prior to conformational rearrangements. They then suggest that an approach resembling the one they undertook can be applied to other protein systems where the coupling mechanism of conformational changes and ligand binding.

      They argue that the intuitive model where ligand binding triggers a functionally relevant conformational change was challenged by structural experiments and MD simulations revealing the existence of unliganded closed or semi-closed states and their dynamic exchange with open unbound conformations, discuss alternative mechanisms that were proposed, their merits and difficulties, concluding that the findings were controversial, which, they suggest is due to insufficient availability of experimental evidence to distinguish them. As to further specific conclusions they draw from their results, they determine that a conformational selection mechanism is incompatible with their results, but induced fit is. They thus propose induced fit as the dominant pathway for GlnBP, further supported by the notion that the open conformation is much more likely to bind substrate than the closed one based on steric arguments.

      The paper here, which clearly embodies massive careful and high-quality work, is extensive, making use of a range of experimental approaches, including isothermal titration calorimetry, single-molecule Förster resonance energy transfer, and surface-plasmon resonance spectroscopy. The problem the authors undertake is of fundamental importance.

    3. Reviewer #2 (Public review):

      The authors provide convincing data from a whole set of different binding kinetic and thermodynamic experiments to explore whether glutamine binding protein binds glutamine via an induced fit or a conformational selection process.

      Weaknesses:

      The single-molecule TIRF-smFRET data appear to include spots that may represent more than one molecule, which raises the general issue of how rigorously traces were selected for single photobleaching events.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Here the authors discuss mechanisms of ligand binding and conformational changes in GlnBP (a small E Coli periplasmic binding protein, which binds and carries L-glutamine to the inner membrane ATP-binding cassette (ABC) transporter). The authors have distinguished records in this area and have published seminal works. They include experimentalists and computational scientists. Accordingly, they provide comprehensive, high-quality, experimental and computational work. They observe that apo- and holo- GlnBP does not generate detectable exchange between open and (semi-) closed conformations on timescales between 100 ns and 10 ms. Especially, the ligand binding and conformational changes in GlnBP that they observe are highly correlated. Their analysis of the results indicates a dominant induced-fit mechanism, where the ligand binds GlnBP prior to conformational rearrangements. They then suggest that an approach resembling the one they undertook can be applied to other protein systems where the coupling mechanism of conformational changes and ligand binding. They argue that the intuitive model where ligand binding triggers a functionally relevant conformational change was challenged by structural experiments and MD simulations revealing the existence of unliganded closed or semi-closed states and their dynamic exchange with open unbound conformations, discuss alternative mechanisms that were proposed, their merits and difficulties, concluding that the findings were controversial, which, they suggest is due to insufficient availability of experimental evidence to distinguish them. As to further specific conclusions they draw from their results, they determine that a conformational selection mechanism is incompatible with their results, but induced fit is. They thus propose induced fit as the dominant pathway for GlnBP, further supported by the notion that the open conformation is much more likely to bind substrate than the closed one based on steric arguments. Considering the landscape of substrate-free states, in my view, the closed state is likely to be the most stable and, thus most highly populated. As the authors note and I agree that state can be sterically infeasible for a deep-pocketed substrate. As indeed they also underscore, there is likely to be a range of open states. If the populations of certain states are extremely low, they may not be detected by the experimental (or computational) methods. The free energy landscape of the protein can populate all possible states, with the populations determined by their relative energies. In principle, the protein can visit all states. Whether a particular state is observed depends on the time the protein spends in that state. The frequencies, or propensities, of the visits can determine the protein function. As to a specific order of events, in my view, there isn't any. It is a matter of probabilities which depend on the populations (energies) of the states. The open conformation that is likely to bind is the most favorable, permitting substrate access, followed by minor, induced fit conformational changes. However, a key factor is the ligand concentration. Ligand binding requires overcoming barriers to sustain the equilibrium of the unliganded ensemble, thus time. If the population of the state is low, and ligand concentration is high (often the case in in vitro experiments, and high drug dosage scenarios) binding is likely to take place across a range of available states. This is however a personal interpretation of the data. The paper here, which clearly embodies massive careful, and high-quality work, is extensive, making use of a range of experimental approaches, including isothermal titration calorimetry, single-molecule Förster resonance energy transfer, and surface-plasmon resonance spectroscopy. The problem the authors undertake is of fundamental importance.

      Reviewer #2 (Public Review):

      The manuscript by Han et al and Cordes is a tour-de-force effort to distinguish between induced fit and conformational selection in glutamine binding protein (GlnBP). 

      We thank the referee for the recognition of the work and effort that has gone into this manuscript. 

      It is important to say that I don't agree that a decision needs to be made between these two limiting possibilities in the sense that whether a minor population can be observed depends on the experiment and the energy difference between the states. That said, the authors make an important distinction which is that it is not sufficient to observe both states in the ligand-free solution because it is likely that the ligand will not bind to the already closed state. The ligand binds to the open state and the question then is whether the ligand sufficiently changes the energy of the open state to effectively cause it to close. The authors point out that this question requires both a kinetic and a thermodynamic answer. Their "method" combines isothermal titration calorimetry, single-molecule FRET including key results from multi-parameter photon-by-photon hidden Markov modelling (mpH2MM), and SPR. The authors present this "method" of combination of experiments as an approach to definitively differentiate between induced fit and conformational selection. I applaud the rigor with which they perform all of the experiments and agree that others who want to understand the exact mechanism of protein conformational changes connected to ligand binding need to do such a multitude of different experiments to fully characterize the process. However, the situation of GlnBP is somewhat unique in the high affinity of the Gln (slow offrate) as compared to many small molecule binding situations such as enzyme-substrate complexes. It is therefore not surprising that the kinetics result in an induced fit situation. 

      For us these comments are an essential part of the conceptual aspects of our work and the resulting research. From a descriptive viewpoint, it is essential for us (and we tried to further highlight and stress this in the updated version of our paper) that IF and CS are two kinetic mechanisms of ligand binding. They imply – if active in a biomolecular system – a temporal order and timescale separation of ligand binding and conformational changes. Since we found many conflicting results for the binding mechanism of GlnBP, but also other SPBs, we decided to assess the situation in GlnBP. 

      In the case of the E-S complexes I am familiar with, the dissociation is much more rapid because the substrate binding affinity is in the micromolar range and therefore the re-equilibration of the apo state is much faster. In this case, the rate of closing and opening doesn't change much whether ligand is present or not. Here, of course, once the ligand is bound the re-equilibration is slow. Therefore, I am not sure if the conclusions based on this single protein are transferrable to most other protein-small molecule systems. 

      We do not argue that our results and interpretations are valid for most other protein-ligand systems may those be enzymes or simple ligand binders. Yet, based on the conservation of ABC-related SBPs and the fact that quite a few of them show sub-µM Kds, we render it likely to find many analogous situations as for GlnBP also based on our previous results e.g., from de Boer et al., eLife (2019).

      I am also not sure if they are transferrable to protein-protein systems where both molecules the ligand and the receptor are expected to have multiscale dynamics that change upon binding.

      As we argue above the two mechanisms IF/CS imply a clear temporal order and separation of timescales for ligand binding and conformational changes. These mechanisms are simple and extreme cases that we tested before more complex kinetic schemes are inferred for the description of ligand binding and conformational changes (which might not be necessary). 

      Strengths:

      The authors provide beautiful ITC data and smFRET data to explore the conformational changes that occur upon Gln binding. Figure 3D and Figure 4 (mpH2MM data) provide the really critical data. The multi-parameter photon-by-photon hidden Markov modelling (mpH2MM) data. In the presence of glutamine concentrations near the Kd, two FRET-active sub-populations are identified that appear to interconvert on timescales slower than 10 ms. They then do a whole bunch of control experiments to look for faster dynamics (Figure 5). They also do TIRF smFRET to try to compare their results to those of previous publications. Here, they find several artifacts are occurring including inactivation of ~50% of the proteins. They also perform SPR experiments to measure the association rate of Gln and obtain expectedly rapid association rates on the order of 10<sup>^</sup>8 M-1s-1.

      Thank you.  

      Weaknesses:

      Looking at the traces presented in the supplementary figures, one can see that several of the traces have more than one molecule present. The authors should make sure that they use only traces with a single photobleaching event for each fluorophore. One can see steps in some of the green traces that indicate two green fluorophors (likely from 2 different molecules) in the traces. This is one of the frequent problems with TIRF smFRET with proteins, that only some of the spots represent single molecules and the rest need to be filtered out of the analysis.

      We have inspected all TIRF data provided with the manuscript and assume that the referee refers to data shown in current Appendix Figure 4/5. We agree that those traces in which no photo bleaching occurs could potentially be questioned, yet they would not change our interpretations and thus decided to leave the figure as is.

      The NMR experiments that the authors cite are not in disagreement with the work presented here. NMR is capable of detecting "invisible states" that occur in 1-5% of the population. SmFRET is not capable of detecting these very minor states. I am quite sure that if NMR spectroscopists could add very high concentrations of Gln they would also see a conversion to the closed population.

      We agree with the referee that NMR is capable of detecting invisible states that occur in 1-5% of the population (see e.g., the paper cited in our manuscript by Tang, C et al., Open-to-closed transition in apo maltose-binding protein observed by paramagnetic NMR. Nature 2007, 449, 1078). Yet, we see a strong disagreement between our work and papers on GlnBP, where a combination of NMR, FRET and MD was used (Feng, Y. et al., Conformational Dynamics of apo‐GlnBP Revealed by Experimental and Computational Analysis. Angewandte Chemie 2016, 55, 13990; Zhang, L. et al., Ligand-bound glutamine binding protein assumes multiple metastable binding sites with different binding affinities. Communications biology 2020, 3, 1). These inconsistencies were also noted by others in the field (Kooshapur, H. et al., NMR Analysis of Apo Glutamine‐Binding Protein Exposes Challenges in the Study of Interdomain Dynamics. Angewandte Chemie 2019, 58, 16899) and we reemphasize that this latest NMR publication comes to similar conclusions as we present in our manuscript.   

      Reviewer #1 (Recommendations For The Authors):

      The paper embodies massive careful and high-quality work, and is extensive, making use of a range of experimental approaches, including isothermal titration calorimetry, single-molecule Förster resonance energy transfer, and surface-plasmon resonance spectroscopy. Considering this extensiveness, I do not see what more the authors can do.

      We very much appreciate the assessment and positive comments of the referee, but still tried to incorporate simulation data to support our interpretations.

      Reviewer #2 (Recommendations For The Authors):

      (1) Looking at the traces presented in the supplementary figures, one can see that several of the traces have more than one molecule present. The authors should make sure that they use only traces with a single photobleaching event for each fluorophore. One can see steps in some of the green traces that indicate two green fluorophors (likely from 2 different molecules) in the traces. This is one of the frequent problems with TIRF smFRET with proteins, that only some of the spots represent single molecules and the rest need to be filtered out of the analysis.

      See response above for iteration of TIRF data selection and analysis.

      (2) The NMR experiments that the authors cite are not in disagreement with the work presented here. NMR is capable of detecting "invisible states" that occur in 1-5% of the population. SmFRET is not capable of detecting these very minor states. I am quite sure that if NMR spectroscopists could add very high concentrations of Gln they would also see a conversion to the closed population.

      See response above.

      Minor point:

      (1) It is difficult to see what is going on between apo and holo in Figure 1B. Could the authors make Figure 1a, 1b apo, and 1b holo in the same orientation (by aligning D2 or D1 to each other in all figures) so one can see which helices are in the same place and which have moved?

      We respectfully disagree and decided to keep this figure as it is

    1. eLife Assessment

      This study presents an important finding linking the bacterial metabolite trimethylamine and its receptor to circadian rhythms and olfaction. The current evidence supporting the claims of the authors is compelling. This work will be of broad interest to researchers interested in nutrition, microbial metabolism, circadian rhythms, and host-microbiome interactions.

    2. Reviewer #1 (Public review):

      Summary:

      This study focuses on the bacterial metabolite TMA, generated from dietary choline. These authors and others have previously generated foundational knowledge about the TMA metabolite TMAO, and its role in metabolic disease. This study extends those findings to test whether TMAO's precursor, TMA, and its receptor TAAR5 are also involved and necessary for some of these metabolic phenotypes. They find that mice lacking the host TMA receptor (Taar5-/-) have altered circadian rhythms in gene expression, metabolic hormones, gut microbiome composition, and olfactory and innate behavior. In parallel, mice lacking bacterial TMA production or host TMA oxidation have altered circadian rhythms.

      Strengths:

      These authors use state-of-the-art bacterial and murine genetics to dissect the roles of TMA, TMAO, and their receptor in various metabolic outcomes (primarily measuring plasma and tissue cytokine/gene expression). They also follow a unique and unexpected behavioral/olfactory phenotype. Statistics are impeccable.

    3. Reviewer #2 (Public review):

      Summary:

      In the manuscript by Mahen et al., entitled "Gut Microbe-Derived Trimethylamine Shapes Circadian Rhythms Through the Host Receptor TAAR5," the authors investigate the interplay between a host G protein-coupled receptor (TAAR5), the gut microbiota-derived metabolite trimethylamine (TMA), and the host circadian system. Using a combination of genetically engineered mouse and bacterial models, the study demonstrates a link between microbial signaling and circadian regulation, particularly through effects observed in the olfactory system. Overall, this manuscript presents a novel and valuable contribution to our understanding of host-microbe interactions and circadian biology. The addition of new data following revision adds mechanistic depth to more fully support the authors' conclusions.

      Strengths:

      (1) The manuscript addresses an important and timely topic in host-microbe communication and circadian biology.

      (2) The studies employ multiple complementary models, e.g., Taar5 knockout mice, microbial mutants, which enhances the depth of the investigation.

      (3) The integration of behavioral, hormonal, microbial, and transcript-level data provides a multifaceted view of the observed phenotype.

      (4) Inclusion of rhythmic analysis of a defined microbial community adds novelty and strength to the overall findings.

      (5) The identification of olfactory-linked circadian changes in the context of gut microbes adds a novel perspective to the field.

      Weaknesses:

      (1) While the authors suggest a causal role for TAAR5 and its ligand in circadian regulation, some of the data remain correlative in this context; however, the authors have appropriately tempered these claims, and mechanistic experiments are proposed to expand upon their compelling findings in future work.

    4. Reviewer #3 (Public review):

      Summary:

      Deletion of the TMA-sensor TAAR5 results in circadian alterations in the gene expression, particularly in the olfactory bulb; plasma hormones; and neurobehaviors.

      Strengths:

      Genetic background was rigorously controlled.

      Comprehensive characterization.

      Impact:

      These data add to the growing literature pointing to a role for the TMA/TMAO pathway in olfaction and neurobehavior.

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This study focuses on the bacterial metabolite TMA, generated from dietary choline. These authors and others have previously generated foundational knowledge about the TMA metabolite TMAO, and its role in metabolic disease. This study extends those findings to test whether TMAO's precursor, TMA, and its receptor TAAR5 are also involved and necessary for some of these metabolic phenotypes. They find that mice lacking the host TMA receptor (Taar5-/-) have altered circadian rhythms in gene expression, metabolic hormones, gut microbiome composition, and olfactory and innate behavior. In parallel, mice lacking bacterial TMA production or host TMA oxidation have altered circadian rhythms.

      Strengths:

      These authors use state-of-the-art bacterial and murine genetics to dissect the roles of TMA, TMAO, and their receptor in various metabolic outcomes (primarily measuring plasma and tissue cytokine/gene expression). They also follow a unique and unexpected behavioral/olfactory phenotype. Statistics are impeccable.

      Weaknesses:

      Enthusiasm for the manuscript is dampened by some ambiguous writing and the presentation of ideas in the introduction, both of which could easily be improved upon revision.

      We apologize for the abbreviated and ambiguous writing style in our original submission. Given Reviewer 2 also suggested reorganizing and rewriting certain parts, we have spent time to remove ambiguity by adding additional points of clarification and adding more historical context to justify studying TMA-TAAR5 signaling in regulating host circadian rhythms. We have also reorganized the presentation of data aligned with this.

      Reviewer #2 (Public review):

      Summary:

      In the manuscript by Mahen et al., entitled "Gut Microbe-Derived Trimethylamine Shapes Circadian Rhythms Through the Host Receptor TAAR5," the authors investigate the interplay between a host G protein-coupled receptor (TAAR5), the gut microbiota-derived metabolite trimethylamine (TMA), and the host circadian system. Using a combination of genetically engineered mouse and bacterial models, the study demonstrates a link between microbial signaling and circadian regulation, particularly through effects observed in the olfactory system. Overall, this manuscript presents a novel and valuable contribution to our understanding of hostmicrobe interactions and circadian biology. However, several sections would benefit from improved clarity, organization, and mechanistic depth to fully support the authors' conclusions.

      Strengths:

      (1) The manuscript addresses an important and timely topic in host-microbe communication and circadian biology.

      (2) The studies employ multiple complementary models, e.g., Taar5 knockout mice, microbial mutants, which enhance the depth of the investigation.

      (3) The integration of behavioral, hormonal, microbial, and transcript-level data provides a multifaceted view of the observed phenotype.

      (4) The identification of olfactory-linked circadian changes in the context of gut microbes adds a novel perspective to the field.

      Weaknesses:

      While the manuscript presents compelling data, several weaknesses limit the clarity and strength of the conclusions.

      (1) The presentation of hormonal, cytokine, behavioral, and microbiome data would benefit from clearer organization, more detailed descriptions, and functional grouping to aid interpretation.

      We appreciate this comment and have reorganized the data to improve functional grouping and readability. We have also added additional detail to descriptions of the data in the revised figure legends and results.

      (2) Some transitions-particularly from behavioral to microbiome data-are abrupt and would benefit from better contextual framing.

      We agree with this comment, and have added additional language to provide smoother transitions. This in many cases brings in historical context of why we focused on both behavioral and microbiome alterations in this body of work.

      (3) The microbial rhythmicity analyses lack detail on methods and visualization, and the sequencing metadata (e.g., sample type, sex, method) are not clearly stated.

      We apologize for this, and have now added more detail in our methods, figures, and figure legends to ensure the reader can easily understand sample type, sex, and the methods used. 

      (4) Several figures are difficult to interpret due to dense layouts or vague legends, and key metabolites and gene expression comparisons are either underexplained or not consistently assessed across models.

      Aligned with the last comment we now added more detail in our methods, figures, and figure legends to provide clear information. We have now provided additional data showing the same key metabolites, hormones, and gene expression alterations in each model if the same endpoints were measured.

      (5) Finally, while the authors suggest a causal role for TAAR5 and its ligand in circadian regulation, the current data remain correlative; mechanistic experiments or stronger disclaimers are needed to support these claims.

      We agree with this comment, and as a result have removed any language causally linking TMA and TAAR5 together in circadian regulation. Instead, we only state finding in each model and refrain from overinterpreting.

      Reviewer #3 (Public review):

      Summary:

      Deletion of the TMA-sensor TAAR5 results in circadian alterations in gene expression, particularly in the olfactory bulb, plasma hormones, and neurobehaviors.

      Strengths:

      Genetic background was rigorously controlled.

      Comprehensive characterization.

      Weaknesses:

      The weaknesses identified by this reviewer are minor.

      Overall, the studies are very nicely done. However, despite careful experimentation, I note that even the controls vary considerably in their gene expression, etc, across time (eg, compare control graphs for Cry 1 in IB, 4B). It makes me wonder how inherently noisy these measurements are. While I think that the overall point that the Taar5 KO shows circadian changes is robust, future studies to dissect which changes are reproducible over the noise would be helpful.

      We thank the reviewer for this insightful comment. We completely agree that there are clear differences in the circadian data in experiments from Taar5<sup>-/-</sup> mice and those from gnotobiotic mice where we have genetically deleted CutC. Although the data from Taar5<sup>-/-</sup> mice show nice robust circadian rhythms, the data from mice where microbial CutC is altered have inherently more “noise”. We attribute some of this to the fact that the Taar5<sup>-/-</sup> mouse experiment have a fully intact and diverse gut microbiome . Whereas, the gnotobiotic study with CutC manipulation includes only a 6 member microbiome community that does not represent the normal microbiome diversity in the gut. This defined synthetic community was used as a rigorous reductionist approach, but likely affected the normal interactions between a complex intact gut microbiome and host circadian rhythms. We have added some additional discussion to indicate this in the limitations section of the manuscript.

      Impact:

      These data add to the growing literature pointing to a role for the TMA/TMAO pathway in olfaction and neurobehavioral.

      Reviewer #1 (Recommendations for the authors):

      I suggest a revision of the writing and organization. The potential impact of the study after reading the introduction is unclear. One example, in the intro, " TMAO levels are associated with many human diseases including diverse forms of CVD5-12, obesity13,14, type 2 diabetes15,16, chronic kidney disease (CKD)17,18, neurodegenerative conditions including Parkinson's and Alzheimer's disease19,20, and several cancers21,22" It would be helpful to explain how the previous literature has distinguished that the driver of these phenotypes is TMA/TMAO and not increased choline intake. Basically, for a TMA/O novice reader, a more detailed intro would be helpful.

      We appreciate this insightful comment and have now provided a more expansive historical context for the reader regarding the effects of choline consumption (which impacts many things, including choline, acetylcholine, phosphatidylcholine, TMA, TMAO, etc) versus the primary effects of TMA and TMAO.

      There were also many uses of vague language (regulation/impact/etc). Directionality would be super helpful.

      We thank the reviewer for this recommendation and have improved language as suggested to show directionality of our findings. The terms regulation, impact, shape etc. are used only when we describe multiple variable changing at the same time over the time course of a 24-hour circadian period (some increased and some decreased).

      Reviewer #2 (Recommendations for the authors):

      In the manuscript by Mahen et al., entitled "Gut Microbe-Derived Trimethylamine Shapes Circadian Rhythms Through the Host Receptor TAAR5," the authors investigate the interplay between a host G protein-coupled receptor (TAAR5), the gut microbiota-derived metabolite trimethylamine (TMA), and the host circadian system. Using a combination of genetically engineered mouse and bacterial models, the study demonstrates a link between microbial signaling and circadian regulation, particularly through effects observed in the olfactory system. Overall, this manuscript presents a novel and valuable contribution to our understanding of hostmicrobe interactions and circadian biology. However, several sections would benefit from improved clarity, organization, and mechanistic depth to fully support the authors' conclusions. Below are specific major and minor suggestions intended to enhance the presentation and interpretation of the data.

      Major suggestions:

      (1) Consider adding a schematic/model figure as Panel A early in the manuscript to help readers understand the experimental conditions and major comparisons being made.

      We thank the reviewer for this recommendation and have added a graphical abstract figure to help the reader understand the major comparisons being made. 

      (2) Could the authors present body weight and food intake characteristics in Taar5 KO vs. WT animals?

      We have added body weight data as requested in Figure 1, Figure supplement 1. Although we have not stressed these mice with a high fat diet for these behavioral studies, under chow-fed conditions studied here we did not find any significant differences in body weight. Given no difference in body weight, we did not collect data on food consumption and have mentioned this as a limitation in the discussion.  

      (3) Several figures, especially Figures 3 and 4, and Supplemental Figures, would benefit from more structured organization and expanded legends. Grouping related data into thematic panels (e.g., satiety vs. appetite hormones, behavioral domains) may help improve readability.

      We appreciate the reviewer’s thoughtful comments and agree that reorganization would improve clarity. We have reorganized figures to improve clarity and have expanded the figure legends to provide more detail on experimental methods. 

      (4) Clarify and expand the description of hormonal and cytokine changes. For instance, the phrase "altered rhythmic levels" is vague - do the authors mean dampened, phase-shifted, enhanced, etc., relative to WT controls?

      Given a similar suggestion was made by Reviewer 1, we have provided more precise language focused on directionality and which specific endpoints we are referring to. For anything looking at circadian rhythms, the revised manuscript includes specific indications when we are discussing mesor, amplitude, and acrophase alterations. The terms regulation, impact, shape etc. are used only when we describe multiple complex variables changing at the same time over the time course of a 24-hour circadian period (some increased and some decreased).

      (5) Consider grouping hormones and cytokines functionally (e.g., satiety vs. appetite-stimulating, pro- vs. antiinflammatory) to better interpret how these changes relate to the KO phenotype.

      We thank the reviewer for this recommendation, and have re-organized figure panels to reflect this.

      (6) Please provide a more detailed description of the behavioral results, particularly those in Supplemental Figure 2.

      We have both expanded the methods description in the revised figure legends, but have also added a more detailed description of the behavioral results.

      (7) As with hormonal data, behavioral outcomes would be easier to follow if organized thematically (e.g., locomotor activity, anxiety-like behavior, circadian-related behavior), especially for readers less familiar with behavioral assays.

      We appreciate this reviewer’s comment and agree that we can better group our data to show how each test is associated with the type of behavior it assesses. As a result we have reorganized the behavioral data into broad categories such as olfactory-related, innate, cognitive, depressive/anxiety-like, or social behaviors. We have also new data in each of these behavioral categories to provide a more comprehensive understanding of behavioral alterations seen in Taar5<sup>-/-</sup> mice.

      (8) The following statement needs clarification: "Also, it is important to note that many behavioral phenotypes examined, including tests not shown, were unaltered in Taar5-/- mice (Figures S2G, S2H, and S2I)." Consider rephrasing to explicitly state the intended message: are the authors emphasizing a lack of behavioral phenotype, or highlighting specific unaltered aspects?

      We apologize for this confusing statement, and have changed the verbiage to improve readability. To expand the comprehensive nature of this study, we also now include the tests that were “not shown” in the original submission to provide a more comprehensive understanding of behavioral alterations seen in Taar5<sup>-/-</sup> mice. These new data are included as 6 different figure supplements to main Figure 2.

      (9) The transition from behavior to microbiome data feels abrupt. Can the authors better explain whether the behavioral changes are thought to result from gut microbial function, independent of TMA-Taar5 signaling?

      We apologize for the poor transitions in our writing style. We have spent time to explain the previous findings linking the TMA pathway to circadian reorganization of the gut microbiome (mostly coming from our original paper Schugar R, et al. 2022, eLife) and how this correlates with behavioral phenotypes. Although at this point it is difficult to know whether the microbiome changes are driving behavioral changes, or vice versa it could be central TAAR5 signaling is altering oscillations in gut microbiome, we present our findings here as a framework for follow up studies to more precisely get at these questions. It is important to note that our experiment using defined community gnotobiotic mice with or without the capacity to produce TMA (i.e. CutC-null community) shows that clearly microbial TMA production can impact host circadian rhythms in the olfactory bulb. Additional experiments beyond the scope of this work will be required to test which phenotypes originate from TMA-TAAR5 signaling versus more broad effects of the restructured gut microbiome.

      (10) For Figure 3A, please expand the microbiome results with more granularity:

      (a) Indicate in the Results section whether the sequencing method was 16S amplicon or metagenomic.

      Sequencing was done using 16S rRNA amplicon sequencing using methods published by our group (PMID: 36417437, PMID: 35448550).

      (b) State whether samples were from males, females, or a mix. 

      We have indicated that all mice from Figure 1 were male mice in the revised figure legend.

      (c) Clarify whether beta diversity is based on phylogenetic or non-phylogenetic metrics. Consider using both  types if not already done.

      Beta diversity was analyzed using the Bray-Curtis dissimilarity index as the metric. Details have been included in the methods section.

      (d) Make lines partially transparent in the Beta-diversity plot so that individual points are visible.

      We have now updated the Beta-diversity plot with individual points visualized.

      (e) Clarify what percentage of variation in the Beta-diversity plot is explained by CCA1, and whether this low percentage suggests minimal community-level differences.

      We have updated the Beta-diversity plot to include the R<sup>2</sup> and p-values associated with these data.

      (f) Confirm if the y-axis on the Beta-diversity plot should be labeled CCA2 rather than "CCAA 1".

      We appreciate this comments, given it identified a typographical error in the plot. The revised figure now include the proper label of CCA2 instead of CCAA 1.

      (11) For Figure 3B:

      (a) Provide a description of the taxonomy plot in the results.

      We have added a description of the taxonomy plot in the revised results section.

      (b) Add phylum-level labels and enlarge the legend to improve the readability of genus-level data.

      We agree this is a good suggestion so have enlarged the legend for the genus-level data and have also added phylum-level plots as well in the revised manuscript in Figure 3, figure supplement 1.

      (12) Rhythmicity of the microbiome is central to the manuscript. The current approach of comparing relative abundance at discrete time points is limiting.

      We thank the reviewer for this comment. We agree with this statement that discrete timepoint are not enough to describe circadian rhythmicity. In addition to comparing genotypes at discrete time points, we also used a rigorous cosinor analysis to plot the data over a 24-hour time period, and those differences are shown in the figure itself as well as Table 1. 

      (a) Please describe how rhythmicity was determined, e.g., what data or statistical method supports the statement: "Taar5-/- mice showed loss of the normal rhythmicity for Dubosiella and Odoribacter genera yet gained in amplitude of rhythmicity for Bacteroides genera (Figure 3 and S3)."

      We appreciate this reviewer comment. Rhythmicity was determined using a cosinor analysis by use of an R program. Cosinor analysis is a statistical method used to model and analyze rhythmic patterns in time-series data, typically assuming a sinusoidal (cosine) shape. It estimates key parameters like mesor (mean level), amplitude (height of oscillation), and acrophase (timing of the peak), making it especially useful in fields like chronobiology and circadian rhythm research. We have used this in previous research to describe circadian rhythms. We do plan to improve language considering directionality of these circadian changes. 

      (b) Supplemental Figure S3 needs reorganization to highlight key findings. It's not currently clear how taxa are arranged or what trends are being shown.

      The data in Figure S3 show the entire 24-hour time course of the cecal taxa that were significantly altered for at least one time point between Taar5<sup>+/+</sup> and Taar5<sup>-/-</sup> mice. Given we showed time pointspecific alterations in the Main Figure 3, we thought these more expansive plots would be important to show to depict how the circadian rhythms were altered.

      (c) Supplemental Table 1, which includes 16S features, should be referenced and discussed in the microbiome section.

      We have now referenced and discussed Supplemental Table 1 which includes all cosinor statistics for microbiome and other data presented in circadian time point studies.

      (13) Did the authors quantify the 16S rRNA gene via RT-PCR to determine if this was similar between KO and WT over the 24-hour period?

      We did not quantify 16S rRNA gene via RT-PCR, but do not think adding this will change our overall interpretations.

      (14) Reorganize Figure 4 to align with the order of results discussed-starting with TMA and TMAO, followed by related metabolites like choline, L-carnitine, and gamma-butyrobetaine.

      We thank the reviewer for this comment. We have chosen this organization because it is ordered from substrates (choline, L-carnitine, and betaine) to the microbe-associated products (TMA then TMAO). We will improve the writing associated with this figure to clearly explain this organization.

      (a) Although the changes in the latter metabolites are more modest, they may still have physiological relevance. Could the authors comment on their significance?

      We appreciate this reviewer comment and agree. We have expanded the results and discussion to address this.

      (15) The authors note similarities in circadian gene expression between Taar5 KO mice and Clostridium sporogenes WT vs. ΔcutC mice, but the gene patterns are not consistent.

      (a) Can the authors clarify what conclusions can reasonably be drawn from this comparison?

      We hesitate to make definitive conclusions in the manuscript on why the gene patterns are not consistent, because it would be speculation. However, one major factor likely driving differences is the status of the diversity of the gut microbiome in the different studies. For instance, in the studies using Taar5<sup>+/+</sup> and Taar5<sup>-/-</sup> mice there is a very diverse microbiome in these conventionally housed mice. In contrast, by design the experiment using Clostridium sporogenes WT vs. ΔcutC communities is a reductionist approach that allows us to genetically define TMA production. In these gnotobiotic mice, the simplified community has very limited diversity and this likely alters the host circadian rhythms in gene expression quite dramatically. Although it is impossible to directly compare the results between these experiments given the difference microbiome diversity, there are clearly alterations in host gene expression when we manipulate TMA production (i.e. ΔcutC community) or TMA sensing (i.e. Taar5<sup>-/-</sup>). 

      (16) Were circadian and metabolic genes (e.g., Arntl, Cry1, Per2, Pemt, Pdk4) also analyzed in brown adipose tissue of Taar5 KO mice, and how do these results compare to the Clostridium models?

      We thank the reviewer for this comment. Unfortunately, we did not collect brown adipose tissue in our original Taar5 study. We plan on doing this in future follow up studies studying cold-induced thermogenesis that are beyond the scope of this manuscript. However, we have decided to include data from our two timepoint Taar5 study which looks at ZT2 (9am) and ZT14 (9pm). There are clear differences in circadian genes between these timepoints. 

      (17) To allow a more direct comparison, please ensure the same cytokines (e.g., IL-1β, IL-2, TNF-α, IFN-γ, IL6, IL-33) are reported for both the Taar5 KO and microbial models.

      We thank the reviewer for this comment and now include data from the same cytokines for each study.

      (18) What was the defined microbial community used to colonize germ-free mice with C. sporogenes strains? Did this community exhibit oscillatory behavior?

      To define TMA levels using a genetically-tractable model of a defined microbial community, we leveraged access to the community originally described by our collaborator Dr. Federico Rey (University of Wisconsin – Madison) (PMID: 25784704). We chose this community because it provide some functional metabolic diversity and is well known to allow for sufficient versus deficient TMA production. We are thankful for the reviewer comments about oscillatory behavior of this defined community, and to be responsive have performed sequencing to detect the species over time. These data are now included in the revised manuscript and show that there are clear differences in the oscillatory behavior of the defined community members. These data provide additional support that bacterial TMA production not only alters host circadian rhythms, but also the rhythmic behavior of gut bacteria themselves which has never been described before.

      (19) Can the authors explain the rationale for measuring additional metabolites such as tryptophan, indole acetic acid, phenylacetic acid, and phenylacetylglycine? How are these linked to CutC gene function or Taar5 signaling?

      We appreciate that this could be confusing, but have included other gut microbial metabolites to be as comprehensive as possible. This is important to include because we have found in other gnotobiotic studies where we have genetically altered metabolite production, if we alter one gut microbe-derived metabolite there can be unexpected alterations in other distinct classes of microbe-derived metabolites (PMID: 37352836). This is likely due to the fact that complex microbe-microbe and microbehost interactions work together to define systemic levels of circulating metabolites, influencing both the production and turnover of distinct and unrelated metabolites.

      (20) The authors make several strong claims suggesting that loss of Taar5 or disruption of its ligand directly alters the circadian gene network. However, the current data are correlative. The authors should clarify that these findings demonstrate associations rather than direct causal effects, unless additional mechanistic evidence is provided. Approaches such as studies conducted in constant darkness, measurements of wheelrunning behavior, or analyses that control for potential confounding factors, e.g., inflammation or metabolic disruption, would help establish whether the observed changes in clock gene expression are primary or secondary effects. The authors are encouraged to either soften these causal claims or acknowledge this limitation explicitly in the discussion.

      We thank the reviewer for this comment. We agree and have softened our language about direct effects of TMA via TAAR5 because we agree the data presented here are correlative only. 

      Minor suggestions:

      (1) Avoid repetitive phrases such as "it is important to note..." for improved flow. Rephrasing these instances will enhance readability.

      We thank the reviewer for this suggestion and have deleted such repetitive phrases.  

      (2) For Figure 2, remove interpretations above he graphs and use simple, descriptive panel labels, similar to those in Supplemental Figure 2.

      We have removed these interpretations as suggested, but have retained descriptive panel labels to help the reader understand what type of data are being presented.

      Reviewer #3 (Recommendations for the authors):

      Minor:

      In Figure 1D, UCP1 does not appear to be significantly changed.

      We thank the reviewer for this comment and agree that UCP1 gene expression is not significantly altered . However, given the key role that UCP1 plays in white adipose tissue beiging, which is suppressed by the TMAO pathway, we think it is critical to show that this effect appears unaffected by perturbed TMA-TAAR5 signaling.

      It would be helpful, in the discussion, to summarize any consistent changes across Taar5 KO, CutC deletion, and FMO3 deletion.

      We have added this to the discussion, but as discussed above we hesitate to make strong interpretations about consistency between the models because the microbiome diversity is so different between the studies, and we did not measure all endpoints in both models.

      For the Cosinor analysis, it may be helpful to remove the p-values that are >0.05 from the figures.

      We have now removed any non-significant p-values that are associated with our figures. 

      For Figure 2, Supplement 1E, what are the two bars for each genotype?

      We appreciate the reviewer pointing this out and will further explain this test in the figure with labels and in the legend.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Editors comments:

      I would encourage you to submit a revised version that addresses the following two points:

      [a] The point from Reviewer #1 about a possible major confounding factor. The following article might be germane here: Baas and Fennell, 2019: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3339568

      I don’t believe that the point raised by reviewer 1 is a confounder, see my response below.

      This article highlighted was in my reading list, but I did not cite it because I was confused by its methods.

      The point from Reviewer #4 about the abstract. It is important that the abstract says something about how reviewers reacted to the original versions of articles in which they were cited (ie, the odds ratio = 0.84, etc result), before going on to discuss how they reacted to revised articles (ie, the odds ratio = 1.61, etc result). I would suggest doing this along the following lines - but please feel free to reword the passage "but this effect was not strong/conclusive":

      When reviewers were cited in the original version of the article under review, they were less likely to approve the article compared with reviewers who were not cited, but this effect was not strong/conclusive (odds ratio = 0.84; adjusted 99.4% CI: 0.69-1.03). However, when reviewers were cited in the revised version of the article, they were more likely to approve compared with reviewers who were not cited (odds ratio = 1.61; adjusted 99.4% CI: 1.16-2.23).

      I have changed the abstract to include the odds ratios for version 1 and have used the same wording as from the main text.

      Reviewer #1 (Public review):

      Summary:

      The work used open peer reviews and followed them through a succession of reviews and author revisions. It assessed whether a reviewer had requested the author include additional citations and references to the reviewers' work. It then assessed whether the author had followed these suggestions and what the probability of acceptance was based on the authors decision. Reviewers who were cited were more likely to recommend the article for publication when compared with reviewers that were not cited. Reviewers who requested and received a citation were much likely to accept than reviewers that requested and did not receive a citation.

      Strengths and weaknesses:

      The work's strengths are the in-depth and thorough statistical analysis it contains and the very large dataset it uses. The methods are robust and reported in detail.

      I am still concerned that there is a major confounding factor: if you ignore the reviewers requests for citations are you more likely to have ignored all their other suggestions too? This has now been mentioned briefly and slightly circuitously in the limitations section. I would still like this (I think) major limitation to be given more consideration and discussion, although I am happy that it cannot be addressed directly in the analysis.

      This is likely to happen, but I do not think it’s a confounder. A confounder needs to be associated with both the outcome and the exposure of interest. If we consider forthright authors who are more likely to rebuff all suggestions, then they would receive just as many citation and self-citation requests as authors who were more compliant. The behaviour of forthright authors would likely only reduce the association seen in most authors which would be reflected in the odds ratios.

      Reviewer #2 (Public review):

      Summary:

      This article examines reviewer coercion in the form of requesting citations to the reviewer's own work as a possible trade for acceptance and shows that, under certain conditions, this happens.

      Strengths:

      The methods are well done and the results support the conclusions that some reviewers "request" self-citations and may be making acceptance decisions based on whether an author fulfills that request.

      Weakness:

      I thank the author for addressing my comments about the original version.

      Reviewer #3 (Public review):

      Summary:

      In this article, Barnett examines a pressing question regarding citing behavior of authors during the peer review process. In particular, the author studies the interaction between reviewers and authors, focusing on the odds of acceptance, and how this may be affected by whether or not the authors cited the reviewers' prior work, whether the reviewer requested such citations be added, and whether the authors complied/how that affected the reviewer decision-making.

      Strengths:

      The author uses a clever analytical design, examining four journals that use the same open peer review system, in which the identities of the authors and reviewers are both available and linkable to structured data. Categorical information about the approval is also available as structured data. This design allows a large scale investigation of this question.

      Weaknesses:

      My original concerns have been largely addressed. Much more detail is provided about the number of documents under consideration for each analysis, which clarifies a great deal.

      Much of the observed reviewer behavior disappears or has much lower effect sizes depending on whether "Accept with Reservations" is considered an Accept or a Reject. This is acknowledged in the results text. Language has been toned down in the revised version.

      The conditional analysis on the 441 reviews (lines 224-228) does support the revised interpretation as presented.

      No additional concerns are noted.

      Reviewer #4 (Public review):

      Summary:

      This work investigates whether a citation to a referee made by a paper is associated with a more positive evaluation by that referee for that paper. It provides evidence supporting this hypothesis. The work also investigates the role of self-citations by referees where the referee would ask authors to cite the referee's paper.

      Strengths:

      This is an important problem: referees for scientific papers must provide their impartial opinions rooted in core scientific principles. Any undue influence due to the role of citations breaks this requirement. This work studies the possible presence and extent of this.

      The methods are solid and well done. The work uses a matched pair design which controls for article-level confounding and further investigates robustness to other potential confounds.

      Weaknesses:

      The authors have addressed most concerns in the initial review. The only remaining concern is the asymmetric reporting and highlighting of version 1 (null result) versus version 2 (rejecting null). For example the abstract says "We find that reviewers who were cited in the article under review were more likely to recommend approval, but only after the first version (odds ratio = 1.61; adjusted 99.4% CI: 1.16 to 2.23)" instead of a symmetric sentence "We find ... in version 1 and ... in version 2".

      The latest version now includes the results for both versions.

    2. eLife Assessment

      This important study explored a number of issues related to citations in the peer review process. An analysis of more than 37000 peer reviews at four journals found that: i) during the first round of review, reviewers were less likely to recommend acceptance if the article under review cited the reviewer's own articles; ii) during the second and subsequent rounds of review, reviewers were more likely to recommend acceptance if the article cited the reviewer's own articles; iii) during all rounds of review, reviewers who asked authors to cite the reviewer's own articles (a practice known as 'coercive citation') were less likely to recommend acceptance. However, when an author agreed to cite work by the reviewer, the reviewer was more likely to recommend acceptance of the revised article. The evidence to support these claims is convincing.

    3. Joint Public Review:

      From Reviewer 3 previously: Barnett examines a pressing question regarding citing behavior of authors during the peer review process. In particular, the author studies the interaction between reviewers and authors, focusing on the odds of acceptance, and how this may be affected by whether or not the authors cited the reviewers' prior work, whether the reviewer requested such citations be added, and whether the authors complied/how that affected the reviewer decision-making.

      Key findings are a) that reviewers were more likely to approve an article if cited in the submission, b) reviewers who requested a citation in an updated version were less likely to approve, and c) reviewers who requested and received a citation were more likely to approve the revised version.

      Comment from the Reviewing Editor about the latest version:

      This is the third version of this article. Comments made during the peer review of the second version, along with author's responses to these comments, are available below.

      Comments made during the peer review of the first version, along with author's responses to these comments, are available with previous versions of the article.

    1. eLife Assessment

      This important study uses innovative microfluidics-based single-cell imaging to monitor replicative lifespan, protein localization, and intracellular iron levels in aging yeast cells. The evidence for the proposed role of Ssd1 and reduced nutrients for lifespan through limiting iron uptake is convincing, even though some mechanistic details remain unclear. This work will be of interest to cell biologists working on aging and iron metabolism.

    2. Reviewer #1 (Public review):

      Summary:

      Overexpression of the mRNA binding protein Ssd1 was shown before to expand the replicative lifespan of yeast cells, whereas ssd1 deletion had the opposite effect. Here, the authors provide initial evidence that overproduced Ssd1 might act via sequestration of mRNAs of the Aft1/2-dependent iron regulon. Ssd1 overexpression restricts activation of the iron regulon and limits accumulation of Fe2+ inside cells, thereby likely lowering oxidative damage. The effects of Ssd1 overexpression and calorie restriction on lifespan are epistatic, suggesting that they might act through the same pathway.

      Strengths:

      The study is well-designed and involves analysis of single yeast cells during replicative aging. The findings are well displayed and largely support the derived model, which also has implications on lifespan of other organisms including humans.

      Weaknesses:

      The model is largely supported by the findings, however they remain correlative at the same time. Whether the knockout of ssd1 shortens lifespan by increased intracellular Fe2+ levels is unknown and the shortened lifespan might be caused by different Ssd1 functions. The finding that increased Ssd1 levels form condensates in a cell-cycle dependent is interesting, yet the role of the condensates in lifespan expansion remains untested and unlinked.

      Comments on revisions:

      In their revised version and response letter the authors have largely addressed my previous concerns. I would have liked to see an experimental response to some of the points of criticism, but I accept that they have been addressed purely in writing. There are some aspects that should be further elaborated by the authors. I agree that determining the mRNAs that co-sequester with Ssd1 foci will be part of an independent study, yet whether Ssd1 foci are relevant for lifespan expansion remains unclear and I would have hoped for some more detailed consideration on this point in the discussion section. Similarly, it should be clearly stated that the impact of Ssd1 overexpression is unlinked from the cellular function of Ssd1 produced at authentic levels and that the short-lived phenotype of a ssd1 knockout is likely not caused by overactivation of the iron regulon (based on the author´s reply). I will appreciate it if the authors include these aspects more clearly in the discussion.

    3. Reviewer #2 (Public review):

      This manuscript describes the use of a powerful technique called microfluidics to elucidate the mechanisms explaining how overexpression (OE) of Ssd1 and caloric restriction (CR) in yeast extend replicative lifespan (RLS). Microfluidics measures RLS by trapping cells in chambers mounted to a slide. The chambers hold the mother cell but allow daughters to escape. The slide, with many chambers, is recorded during the entire process, roughly 72 hours, with the video monitored afterwards to count how many daughters each of the trapped mothers produces. The power of the method is what can be done with it. For example, the entire process can be viewed by fluorescence so that GFP and mCherry-tagged proteins can be followed as cells age. The budding yeast is the only model where bona fide replicative aging can be measured, and microfluidics is the only system that allows protein localization and levels to be measured in a single cell while aging. The authors do a wonderful job of showing what this combination of tools can do.

      The authors had previously shown that Ssd1, an mRNA-binding protein, extends RLS when overexpressed. This was attributed to Ssd1 sequestering away specific mRNAs under stress, likely leading to reduced ribosomal function. It remained completely unknown how Ssd1 OE extended RLS. The authors observed that overexpressed, but not normally expressed, Ssd1 formed cytoplasmic condensates during mitosis that are resolved by cytokinesis. When the condensates fail to be resolved at the end of mitosis, this signals death.

      It has become clear in the literature that iron accumulation increases with age within the cell. The transcriptional programs that activate the iron regulon also become elevated in aging cells. This is thought to be due to impaired mitochondrial function in aging cells, with increased iron accumulation as an attempt at restoring mitochondrial activity. The authors show that Ssd1 OE and CR both reduce the expression of the iron regulon. The data presented indicate that iron accumulation shortens RLS: deletion of iron regulon components extends RLS, and adding iron to WT cells decreases RLS, but not when Ssd1 is overexpressed or when cells are calorically restricted. Interestingly, iron chelation using BPS has no impact on WT RLS, but decreases the elevated RLS in CR cells and cells overexpressing Ssd1. It was not initially clear why iron chelation would inhibit the extended lifespan seen with CR and Ssd1 OE. This was addressed by an experiment where it was shown that the iron regulon is induced (FIT2 induction) when iron is chelated. Thus, the detrimental effects of induction of the iron regulon by BPS and iron accumulation on RLS cannot be tempered by Ssd1 OE and CR once turned on.

      Comments on Revised Version:

      I am content with the authors' responses to my prior comments.

    4. Reviewer #3 (Public review):

      In this paper, the authors investigate how the RNA-binding protein Ssd1 and calorie restriction (CR) influence yeast replicative lifespan, with a particular focus on age-dependent iron uptake and activation of the iron regulon. For this, they use microfluidics-based single-cell imaging to monitor replicative lifespan, protein localization, and intracellular iron levels across aging cells. They show that both Ssd1 overexpression and CR act through a shared pathway to prevent the nuclear translocation of the iron-regulon regulator Aft1 and the subsequent induction of high-affinity iron transporters. As a result, these interventions block the age-related accumulation of intracellular free iron, which otherwise shortens lifespan. Genetic and chemical epistasis experiments further demonstrate that suppression of iron regulon activation is the key mechanism by which Ssd1 and CR promote replicative longevity.

      Overall, the paper is technically rigorous, and the main conclusions are supported by a substantial body of experimental data. The microfluidics-based assays in particular provide compelling single-cell evidence for the dynamics of Ssd1 condensates and iron homeostasis.

      My main concern, however, is that the central reasoning of the paper-that Ssd1 overexpression and CR prevent the activation of the iron regulon-appears to be contradicted by previous findings, and the authors may actually be misrepresenting these studies, unless I am mistaken. In the manuscript, the authors state on two occasions:

      "Intriguingly, transcripts that had altered abundance in CR vs control media and in SSD1 vs ssd1∆ yeast included the FIT1, FIT2, FIT3, and ARN1 genes of the iron regulon (8)"

      "Ssd1 and CR both reduce the levels of mRNAs of genes within the iron regulon: FIT1, FIT2, FIT3 and ARN1 (8)"

      However, reference (8) by Kaeberlein et al. actually says the opposite:

      "Using RNA derived from three independent experiments, a total of 97 genes were observed to undergo a change in expression >1.5-fold in SSD1-V cells relative to ssd1-d cells (supplemental Table 1 at http://www.genetics.org/supplemental/). Of these 97 genes, only 6 underwent similar transcriptional changes in calorically restricted cells (Table 2). This is only slightly greater than the number of genes expected to overlap between the SSD1-V and CR datasets by chance and is in contrast to the highly significant overlap in transcriptional changes observed between CR and HAP4 overexpression (Lin et al. 2002) or between CR and high external osmolarity (Kaeberlein et al. 2002). Intriguingly, of the 6 genes that show similar transcriptional changes in calorically restricted cells and SSD1-V cells, 4 are involved in iron-siderochrome transport: FIT1, FIT2, FIT3, and ARN1 (supplemental Table 1 at http://www.genetics.org/supplemental/)."

      Although the phrasing might be ambiguous at first reading, this interpretation is confirmed upon reviewing Matt Kaeberlein's PhD thesis: https://dspace.mit.edu/handle/1721.1/8318

      (page 264 and so on)

      Moreover, consistent with this, activation of the iron regulon during calorie restriction (or the diauxic shift) has also been observed in two other articles:

      https://doi.org/10.1016/S1016-8478(23)13999-9

      https://doi.org/10.1074/jbc.M307447200

      Taken together, these contradictory data might blur the proposed model and make it unclear how to reconcile the results.

      Comments on revisions:

      The authors successfully addressed my requests and concerns

    5. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #2 (Public review):

      (1) Why would BPS not reduce RLS in WT cells? The authors could test whether OE of FIT2 reduces RLS in WT cells.  

      Our data indicate that the iron regulon gets turned on naturally in old cells, presumably due to reduced iron sensing, limiting their lifespan. Although we haven’t tested it experimentally, BPS would also turn on the iron regulon presumably in wild type cells and therefore would have a redundant effect with the activation of the iron regulon that occurs naturally during normal aging. It may be interesting in the future to see if higher levels of BPS can shorten the lifespan of wildtype cells. Similarly, we would predict that overexpression of FIT2 may reduce the lifespan, as its deletion has been shown to extend RLS.  

      (2) The authors should add a brief explanation for why the GDP1 promoter was chosen for Ssd1 OE.

      We used the same promoter that was used to overexpress Ssd1 in all previous studies. This is now stated in the text along with the relevant citations. 

      (3) On page 12, growth to saturation was described as glucose starvation. This is more accurately described as nutrient deprivation. Referring to it as glucose starvation is akin to CR, which growing to saturation is not. Ssd1 OE formed condensates upon saturation but not in CR. Why do the authors think Ssd1 OE did not form condensates upon CR?

      Too mild a stress?

      This is a fair comment, and we have now changed glucose starvation to nutrient deprivation, as it is more accurate. The effects of nutrient starvation are profound: the cell cycle stops, autophagy is induced, cells undergo the diauxic shift, metabolism changes. None of these changes occur during calorie restriction (0.05% glucose) such that it is not too surprising that Ssd1 does not form condensates during CR. We speculate that the stress is just too mild.   

      (4) The authors conclude that the main mechanism for RLS extension in CR and Ssd1 OE is the inhibition of the iron regulon in aging cells. The data certainly supports this. However, this may be an overstatement as other mutations block CR, such as mutations that impair respiration. The authors do note that induction of the iron regulon in aging cells could be a response to impaired mitochondrial function. Thus, it seems that the main goal of CR and Ssd1 OE may be to restore mitochondrial function in aging cells, one way being inactivation of the iron regulon. A discussion of how other mutations impact CR would be of benefit.

      While some labs have shown that respiration impacts CR, this is not the case in other studies. For example, an impactful paper by Kaeberlein et al., PLOS Genetics 2005 showed that CR does extend lifespan in respiratory deficient strains using many different strain backgrounds.

      (5) The cell cycle regulation of Ssd1 OE condensates is very interesting. There does not appear to be literature linking Ssd1 with proteasome-dependent protein turnover. Many proteins involved in cell cycle regulation and genome stability are regulated through ubiquitination. It is not necessary to do anything here about it, but it would be interesting to address how Ssd1 condensates may be regulated with such precision.

      we see no evidence of changes in Ssd1 protein intensity during the cell cycle. The difference therefore we speculate is at the post translational level rather than Ssd1 degradation and there are known cell cycle regulated phosphatase and kinase that regulates Ssd1 phosphorylation and condensation state whose timing of function match when the Ssd1 condensates appear and dissolve in the cell cycle. We have now discussed this and elude to it in the model. 

      (6) While reading the draft, I kept asking myself what the relevance to human biology was. I was very impressed with the extensive literature review at the end of the discussion, going over how well conserved this strategy is in yeast with humans. I suggest referring to this earlier, perhaps even in the abstract. This would nail down how relevant this model is for understanding human longevity regulation.

      Thank you, we have now mentioned in the abstract the relevance to human work. 

      In conclusion, I enjoyed reading this manuscript, describing how Ssd1 OE and CR lead to RLS increases, using different mechanisms. However, since the 2 strategies appear to be using redundant mechanisms, I was surprised that synergism was not observed.

      We thank the reviewer for their kind comment. We propose that Ssd1 overexpression impacts the levels of the iron regulon transcripts, which would be downstream of the point in the pathway that is affected by CR, i.e., nuclear localization of Aft1. The lack of synergy fits with this model, as Ssd1 overexpression cannot impact the iron regulon transcripts if they are not induced due to CR. We have now improved the model to make the impact of these different anti-aging interventions on activation of the iron regulon more clear.

      Reviewer #3 (Public review):

      My main concern is that the central reasoning of the paper-that Ssd1 overexpression and CR prevent the activation of the iron regulon-appears to be contradicted by previous findings, and the authors may actually be misrepresenting these studies, unless I am mistaken. In the manuscript, the authors state on two occasions:

      "Intriguingly, transcripts that had altered abundance in CR vs control media and in SSD1 vs ssd1∆ yeast included the FIT1, FIT2, FIT3, and ARN1 genes of the iron regulon (8)"

      "Ssd1 and CR both reduce the levels of mRNAs of genes within the iron regulon: FIT1, FIT2, FIT3 and ARN1 (8)"

      However, reference (8) by Kaeberlein et al. actually says the opposite:

      "Using RNA derived from three independent experiments, a total of 97 genes were observed to undergo a change in expression >1.5-fold in SSD1-V cells relative to ssd1d cells (supplemental Table 1 at http://www.genetics.org/supplemental/). Of these 97 genes, only 6 underwent similar transcriptional changes in calorically restricted cells (Table 2). This is only slightly greater than the number of genes expected to overlap between the SSD1-V and CR datasets by chance and is in contrast to the highly significant overlap in transcriptional changes observed between CR and HAP4 overexpression (Lin et al. 2002) or between CR and high external osmolarity (Kaeberlein et al. 2002). Intriguingly, of the 6 genes that show similar transcriptional changes in calorically restricted cells and SSD1-V cells, 4 are involved in ironsiderochrome transport: FIT1, FIT2, FIT3, and ARN1 (supplemental Table 1 at http://www.genetics.org/supplemental/)."

      Although the phrasing might be ambiguous at first reading, this interpretation is confirmed upon reviewing Matt Kaeberlein's PhD thesis: https://dspace.mit.edu/handle/1721.1/8318 (page 264 and so on).

      Moreover, consistent with this, activation of the iron regulon during calorie restriction (or the diauxic shift) has also been observed in two other articles:

      https://doi.org/10.1016/S1016-8478(23)13999-9

      https://doi.org/10.1074/jbc.M307447200

      Taken together, these contradictory data might blur the proposed model and make it unclear how to reconcile the results.

      We thank the reviewer for pointing this out. Upon further consideration, we have now removed all mention of this paper from our manuscript as it is irrelevant to our situation, because the mRNA abundance studies during CR or with and without Ssd1 were not performed in situations in which the iron regulon is even activated such as aging, so there would not be any opportunity to detect reduced transcript levels due to CR or Ssd1 presence. Also, none of these studies were performed with Ssd1 overexpression which is the situation we are examining.  Our data clearly show that Ssd1 overexpression and CR reduced / prevented, respectively, production of proteins from the iron regulon during aging.

      We do not feel that the iron regulon being activated by nutrient depletion at the diauxic shift is a fair comparison to the situation in cells happily dividing during CR. The levels of nutrient deprivation used in those studies have profound effects including arresting cell growth, activating autophagy, altering metabolism. The levels of CR that we use (0.05% glucose) does not activate any of these changes nor the iron regulon in young cells or old cells (Fig. 4).  

      Reviewer #1 (Recommendations for the authors):

      (1) The role of Ssd1 condensate formation in mRNA sequestration and lifespan expansion remains unclear. Thus, the study involves two parts (Ssd1 condensate formation and lifespan expansion via limiting Fe2+ accumulation), which are poorly linked. The study will therefore benefit from further data linking the two aspects.

      Future experiments are planned to determine what mRNAs reside in the age-induced Ssd1 overexpression condensates, to determine if they include the iron regulon transcripts. This will require us to optimize isolation of old cells and isolation of the Ssd1 condensates from them, and is beyond the scope of the present study.

      (2) The beneficial effects of Ssd1 overexpression and calorie restriction (CR) on lifespan are epistatic, yet the claim that both experimental conditions act via the same pathway should be further documented. It is recommended to combine Ssd1 overexpression with a well-defined condition that expands lifespan through a mechanism not involving changes in Fe2+ levels. A further increase in lifespan upon combining such conditions would at least indirectly support the authors' claim.

      We have more than epistatic evidence to indicate that Ssd1 overexpression and CR are in the same pathway. Ssd1 overexpression and CR result in failure to properly induce the iron regulon during aging and subsequent reduced levels of iron, resulting in lifespan extension, supporting that they act via the same pathway. We do appreciate the point though and epistasis analyses are on our list for future studies.

      (3) It is highly recommended to analyze ssd1 knockout cells: Is the shortened lifespan caused by intracellular Fe2+ accumulation, as predicted by the model? Does the knockout lead to an overactivation of the iron regulon? Such analysis will also document the physiological relevance of authentic Ssd1 levels in controlling yeast lifespan. The authors could test this possibility by determining intracellular Fe2+ levels (as done in Figure 5) and testing whether the mutant cells are partially rescued by the presence of an iron chelator (as done in Figure 5C).

      We don’t think the normal role of Ssd1 is to sequester the iron regulon mRNAs to prevent its activation, given that wild type yeast with endogenous Ssd1 activates the iron regulon during aging. Rather, the failure to activate the iron regulon during aging is unique to when Ssd1 is overexpressed not at endogenous Ssd1 levels. As such, it may not be the case that the short lifespan of ssd1 yeast is due to iron accumulation (if that happens); yeast lacking SSD1 also have cell wall biogenesis problems and the defects in cell wall biogenesis shorten the replicative lifespan (Molon et al., Biogerentology 2018  PMID 29189912). 

      (4) Figure 4: The authors could not analyze the impact of Ssd1 overexpression on the localization of GFP-Aft1 due to synthetic sickness. This was not observed under calorie restriction (CR) conditions and is therefore unexpected. Why should Ssd1 overexpression and CR have such diverse impacts on cellular physiology when combined with GFP-Aft1? Isn`t that observation arguing against CR and increased Ssd1 levels acting through the same pathway? A further clarification of this point is necessary.

      Without further experimentation, we can only speculate that cellular changes that are unique to overexpression of Ssd1 and not shared with CR cause a negative interaction with GFP-Aft1. Of note, Aft1 has functions in addition to its role in activating the iron regulon (aft1∆ strains have a growth defect independent from its role in iron regulon activation [27]) and we have shown previously that overexpressed Ssd1 has a reduction in global protein translation. Future experiments would be necessary to delineate the basis for this synthetic sickness.

      (5) Lowering Fe2+ levels upon Ssd1 overexpression is predicted to reduce oxidative stress. It is suggested to determine ROS levels upon Ssd1 overexpression to bolster that point.

      This is a great suggestion. The lowering of Fe2+ in the Ssd1 mutants is something that happens at the end of the lifespan and therefore we would need to do experiments to detect reduced ROS using a live dye on our microfluidics platform. We are not aware of any live fluorescent reporters of ROS.  

      Reviewer #2 (Recommendations for the authors):

      (1) Page 6, 7th line of Replicative lifespan analyses, there is a double bracket.

      This has been corrected. Thank you

      (2) Page 18, line 6 of "failure to activate..." section, "revered" should be replaced with "reversed".

      This has been corrected. Thank you

      (3) Page 23, fix writing on line 2 of "Effects of CR..." section.

      This has been corrected. Thank you

      (4) Page 24, Author contributions section, replace "performed devised" with "designed".

      This has been corrected. Thank you

      Reviewer #3 (Recommendations for the authors):

      (1) Figure 3C: The panel legend is somewhat confusing due to the color scheme and the scattering of labels across panels. A more consistent labeling strategy would help readability.

      We agree, and the labelling has now been improved. Thank you. 

      (2) Figure 3D vs Figure 3B: it appears that Fit2 activation occurs substantially earlier than Aft1 translocation, which reduces the predictive value of Fit2 compared to Aft1. This is puzzling given that Fit2 is expected to be a direct target of Aft1. Could this discrepancy be related to the thresholding used for Fit2-mCherry display? The color scale in Figure 3D is also somewhat misleading, as most of the segments appear greenish. A continuous color gradient, perhaps restricted to the [10-120] interval, might give a clearer picture of iron regulon activation.

      For the Aft1-mcherry experiment, we are only able to accurately annotate nuclear localization when Aft1 has been fully (or mostly) translocated into the nucleus from the cytoplasm such that this data is likely to be on the conservative side. However, activation of the iron regulon likely occurs as Aft1 is translocated into the nucleolus, so a minimal initial amount of Aft1 (for which we don’t have enough resolution in this system to detect) could be enough for FIT2 and ARN1 induction.  By contrast, the Fit2 and Arn1 signal is measuring increase over a background of nothing, so is very easy to detect even at low level induction. To allow the readers to see all our data without over thresholding, we prefer to present the induction of Fit2 and Arn1 at all intensity levels even the very low level induction (green).

      (3) "In control strains, expression of Fit2 and Arn1 varied across the population, but generally increased with age": for the right panel, normalization might be more appropriate. What is the fold change in fluorescence during lifespan? Reporting ΔmCherry intensity alone does not provide a quantitative measure of induction.

      We have changed the figure to show quantitation as fold change, as suggested.

      (4) Figure 6 (model): The model figure is conceptually useful but not easy to follow in its current form; a revised schematic with a clearer depiction of the pathway activations at different replicative ages would be helpful.

      We have changed the figure to make the model more clear, as suggested.

    1. eLife Assessment

      This valuable study investigates how perceptual and semantic features of maternal behavior adapt to infants' attention during naturalistic play, providing new insights into the bidirectional and hierarchical organization of early social interaction. The methodology is innovative and overall solid, supported by comprehensive multimodal analyses and advanced information-theoretic methods, though some developmental claims warrant further tests of directionality and age effects. The work will be of interest to psychologists, cognitive scientists, and developmental researchers studying early communication, social learning, and methodological innovation in quantifying naturalistic behavior.

    2. Reviewer #1 (Public review):

      Summary:

      This paper investigates infants' social perception as reflected in looking behavior during face-to-face mother-infant toy play in two groups (5 and 15 months). Using information-theoretic and computer-vision methods, the authors quantify dynamic changes in lower-level (salience) and higher-level (semantic) features in the auditory and visual domains - primarily from mothers - and relate these to infants' real-time attention to toys (and to mothers). Time-lagged correlations suggest dynamic, reciprocal relations between infants' attention and maternal low-level (salience) and high-level (semantic) features at both ages, consistent with an early emergence of interpersonal social contingency based on multi-level information during interaction.

      Strengths:

      The study uses a naturalistic, multimodal mother-infant free-play paradigm and applies information-theoretic/AI methods to quantify both low- and high-level features of maternal behavior, enabling a fine-grained decomposition of interaction dynamics. The time-lag approach further allows examination of temporal relations between maternal signals and infants' attention.

      Weaknesses:

      Directionality claims from cross-correlations are sometimes unclear, especially when both positive and negative lags are significant, and the evidence for age effects is not yet convincing. Infant attention was manually coded with only moderate-substantial agreement, and handling of disagreements/uncodable periods should be clarified and acknowledged as a limitation.

    3. Reviewer #2 (Public review):

      Summary:

      This study examines the dynamic interplay between infant attention and hierarchical maternal behaviors from a social information processing perspective. By employing a comprehensive naturalistic framework, the author quantified interactions across both low-level (sensory) and high-level (semantic) features. With correlation analysis with these features, they found that within social contexts, behaviors such as joint attention - shaped by mutual interaction - exhibit patterns distinct from unilateral responding or mimicry. In contrast to traditional semi-structured behavioral observation and coding, the methods employed in this study were designed to consciously and sensitively capture these dynamic features and relate them temporally. This approach contributes to a more integrated understanding of the developmental principles underlying capacities like joint action and communication.

      Strengths:

      The manuscript's core strength lies in its innovative, dynamic, and hierarchical framework for investigating early social attention. The findings reveal complex adaptive scaffolding strategies: for instance, when infants focus on objects, mothers reduce low-level sensory input, minimising distractions. Furthermore, the results indicate that, even from early development, maternal behaviors are both driven by and predictive of infant attention, confirming that attention involves complex interactive processes that unfold across multiple levels, from salience to semantics.

      From a methodological standpoint, the use of unstructured play situations, combined with multi-channel, high-precision time-series analyses, undoubtedly required substantial effort in both data collection and coding. Compared to the relatively two-dimensional analytical approaches common in prior research, this study's introduction of lower-level and higher-level features to explore the hierarchical organization of processing across development is highly plausible. The psychological processes reflected by these quantified physical features span multiple domains - including emotion, motion, and phonetics - and the high temporal sampling rate ensures fine-grained resolution.

      Critically, these features are extracted through a suite of advanced machine learning and computational methods, which automate the extraction of objective metrics from audiovisual data. Consequently, the methodological flow significantly enhances data utilization and offers valuable inspiration for future behavioral coding research aiming for high ecological validity.

      Weaknesses:

      The conclusion of this paper is generally supported by the data and analysis, but some aspects of data analysis need to be clarified and extended.

      (1) A more explicit justification for the selection and theoretical categorization of the eight interaction features may be needed. The paper introduces a distinction between "lower-level" and "higher-level" features but does not clearly articulate the criteria underpinning this classification. While a continuum is acknowledged, the practical division requires a principled rationale. For instance, is the classification based on the temporal scale of the features, the degree of cognitive processing required for their integration, or their proximity to sensory input versus semantic meaning?

      (2) The claims regarding age-related differences in Predictions 2 are not fully substantiated by the current analyses. The findings primarily rely on observing that an effect is significant in one age group but not the other (e.g., the association between object naming and attention is significant at 15 months but not at 5 months). However, this pattern alone does not constitute evidence about whether the two age groups differ significantly from each other. The absence of a direct statistical comparison (e.g., an interaction test in a model that includes age as a factor) creates an inferential gap. To robustly support developmental change, formal tests of the Age × Feature interaction on infant attention are required.

      (3) Another potential methodological issue concerns the potential confounding effect of parents' use of the infant's name. The analysis of "object naming" does not clarify whether utterances containing object words (e.g., "panda") were distinct from those that also incorporated the infant's name (e.g., "Look, Sarah, the panda!"). Given that a child's own name is a highly salient social cue known to robustly capture infant attention, its co-occurrence with object labels could potentially inflate or confound the measured effect of object naming itself. It would be important to know whether and how frequently infants' names were called, whether this variable was analyzed separately, and if its effect was statistically disentangled from that of pure object labeling.

      (4) Interpretation of results requires clarification regarding the extended temporal lags reported, specifically the negative correlation between maternal vocal spectral flux and infant attention at 6.54 to 9.52 seconds (Figure 4C). The authors interpret this as a forward-prediction, suggesting that a decrease in acoustic variability leads to increased infant attention several seconds later. However, a lag of such duration seems unusually long for a direct, contingent infant response to a specific vocal feature. Is there existing empirical evidence from infant research to support such a prolonged response latency? Alternatively, could this signal suggest a slower, cyclical pattern of the interaction rather than a direct causal link?

    4. Reviewer #3 (Public review):

      Summary:

      This manuscript presents an ambitious integration of multiple artificial intelligence technologies to examine social learning in naturalistic mother-infant interactions. The authors aimed to quantify how information flows between mothers and infants across different communicative modalities and timescales, using speech analysis (Whisper), pose detection (MMPose), facial expression recognition, and semantic modeling (GPT-2) in a unified analytical framework. Their goal was to provide unprecedented quantitative precision in measuring behavioral coordination and information transfer patterns during social learning, moving beyond traditional observational coding approaches to examine cross-modal coordination patterns and semantic contingencies in real-time across multiple temporal scales.

      Strengths:

      The integration of multiple AI tools into a coherent analytical framework represents a genuine methodological breakthrough that advances our capabilities for studying complex social phenomena. The authors successfully analyzed naturalistic interactions at a scale and level of detail that was not previously possible, examining 33 5-month-old and 34 15-month-old dyads across multiple modalities simultaneously. This sophisticated analytical pipeline, combining speech analysis, semantic modeling, pose detection, and facial expression recognition, provides new capabilities for studying social interactions that extend far beyond what traditional observational coding could achieve.

      The specific findings about hierarchical information flow patterns across different timescales are particularly valuable and would not have been possible without this sophisticated analytical approach. The discovery that mothers reduce low-level sensory input when infants focus on objects, while increases in object naming and information rate associate with sustained attention, provides new empirical insights into how social learning unfolds in naturalistic settings. The temporal dynamics analyses reveal interesting patterns of behavioral coordination that extend our understanding of how caregivers adaptively modify their responses to support infant attention across multiple communicative channels simultaneously.

      The scale of data collection and the comprehensive multi-modal approach are impressive, opening up new possibilities for understanding social learning processes. The methodological innovations demonstrate how modern computational tools can be systematically integrated to reveal new quantitative aspects of well-established developmental phenomena. The computational features developed for this study represent innovative applications of information theory and computer vision to developmental research.

      Weaknesses:

      Several major limitations affect the reliability and interpretability of the findings. The sample sizes of 33-34 dyads per age group are relatively modest for the complexity of analyses performed, which include eight different features examined across various time lags with extensive statistical comparisons. The study lacks adequate power analysis to demonstrate whether these sample sizes are sufficient to detect meaningful effect sizes, which is particularly concerning given the multiple comparison burden inherent in this type of multi-modal, multi-timescale analysis.

      The statistical framework presents several concerns that limit confidence in the findings. Inter-rater reliability for gaze coding shows substantial but not excellent agreement (κ = 0.628), with only 22% of the data undergoing double coding. Given that gaze coding forms the foundation for all subsequent analyses of joint attention and information flow, this reliability level may systematically influence findings. The multiple comparison correction strategies vary inconsistently across different analyses, with some using FDR correction and others treating lower-level and higher-level features separately. Additionally, object naming analyses employed one-sided tests (p<0.05) while others used two-sided tests (p<0.025) without clear theoretical or methodological justification for these differences.

      The validation of AI tools in the specific context of mother-infant interactions is insufficient and represents a critical limitation. The performance characteristics of Whisper with infant-directed speech, the precision of MMPose for detecting facial landmarks in young children, and the accuracy of facial expression recognition tools in infant contexts are not adequately validated for this population. These sophisticated tools may not perform optimally in the specific context of mother-infant interactions, where speech patterns, facial expressions, and body movements may differ substantially from their training data.

      The theoretical positioning requires substantial refinement to better acknowledge the extensive existing literature. The authors are working within a well-established theoretical framework that has long recognized social learning as an active, bidirectional process. The joint attention literature, beginning with foundational work by Bruner (1983) and continuing through contemporary theories of social cognition by researchers like Tomasello (1995), has emphasized the communicative and adaptive nature of attentional processes. The scaffolding literature, including seminal work by Wood, Bruner, and Ross (1976), has demonstrated how parents adjust their support based on children's developing competencies. Moreover, there is a substantial body of micro-analytic research that has employed sophisticated quantitative methods to study social interactions, including work by Stern (1985) on microsecond-level interactions and research using time-series methods to examine dyadic coordination patterns.

      The cross-correlation analyses have inherent limitations for causal inference that are not adequately acknowledged. The interpretation of temporal correlation patterns in terms of directional influence requires more cautious consideration, as observational data have fundamental constraints for establishing causality. The ecological validity is also questionable due to the laboratory tabletop interaction paradigm and the sample's demographic homogeneity, consisting primarily of white, highly educated, high-income mothers.

    1. eLife Assessment

      This valuable study focuses on a unique morphogenetic module, the junction-based lamellipodia (JBL). It provides a biomechanical understanding of how JBLs control endothelial cell-cell junctional remodelling to generate lumenised, multicellular blood vessels. The manuscript represents a robust, thoughtfully executed, and convincing study that uses high-resolution time-lapse imaging combined with pharmacological treatments to advance our understanding of lumen formation in vascular development.

    2. Reviewer #1 (Public review):

      Summary:

      Lumen formation is a fundamental morphogenetic event essential for the function of all tubular organs, notably the vertebrate vascular network, where continuous and patent conduits ensure blood flow and tissue perfusion. The mechanisms by which endothelial cells organize to create and maintain luminal space have historically been categorized into two broad strategies: cell shape changes, which involve alterations in apical-basal polarity and cytoskeletal architecture, and cell rearrangements, wherein intercellular junctions and positional relationships are remodeled to form uninterrupted conduits. The study presented here focuses on the latter process, highlighting a unique morphogenetic module, junction-based lamellipodia (JBL), as the driver for endothelial rearrangements.

      Strengths:

      The key mechanistic insight from this work is the requirement of the Arp2/3 complex, the classical nucleator of branched actin filament networks, for JBL protrusion. This implicates Arp2/3-mediated actin polymerization in pushing force generation, enabling plasma membrane advancement at junctional sites. The dependence on Arp2/3 positions JBL within the family of lamellipodia-like structures, but the junctional origin and function distinguish them from canonical, leading-edge lamellipodia seen in cell migration.

      Weaknesses:

      The study primarily presents descriptive observations and includes limited quantitative analyses or genetic modifications. Molecular mechanisms are typically interrogated through the use of pharmacological inhibitors rather than genetic approaches. Furthermore, the precise semantic distinction between JAIL and JBL requires additional clarification, as current evidence suggests their biological relevance may substantially overlap.

    3. Reviewer #2 (Public review):

      Summary:

      In Maggi et al., the authors investigated the mechanisms that regulate the dynamics of a specialized junctional structure called junction-based lamellipodia (JBL), which they have previously identified during multicellular vascular tube formation in the zebrafish. They identified the Arp2/3 complex to dynamically localize at expanding JBLs and showed that the chemical inhibition of Arp2/3 activity slowed junctional elongation. The authors therefore concluded that actin polymerization at JBLs pushes the distal junction forward to expand the JBL. They further revealed the accumulation of Myl9a/Myl9b (marker for MLC) at the junctional pole, at interjunctional regions, suggesting that contractile activity drives the merging of proximal and distal junctions. Indeed, chemical inhibition of ROCK activity decreased junctional mergence. With these new findings, the authors added new molecular and cellular details into the previously proposed clutch mechanism by proposing that Arp2/3-dependent actin polymerization provides pushing forces while actomyosin contractility drives the merging of proximal and distal junctions, explaining the oscillatory protrusive nature of JBLs.

      Strengths:

      The authors provide detailed analyses of endothelial cell-cell dynamics through time-lapse imaging of junctional and cytoskeletal components at subcellular resolution. The use of zebrafish as an animal model system is invaluable in identifying novel mechanisms that explain the organizing principles of how blood vessels are formed. The data is well presented, and the manuscript is easy to read.

      Weaknesses:

      While the data generally support the conclusions reached, some aspects can be strengthened. For the untrained eye, it is unclear where the proximal and distal junctions are in some images, and so it is difficult to follow their dynamics (especially in experiments where Cdh5 is used as the junctional marker). Images would benefit from clear annotation of the two junctions. All perturbation experiments were done using chemical inhibitors; this can be further supported by genetic perturbations.

    4. Reviewer #3 (Public review):

      The paper by Maggi et al. builds on earlier work by the team (Paatero et al., 2018) on oriented junction-based lamellipodia (JBL). They validate the role of JBLs in guiding endothelial cell rearrangements and utilise high-resolution time-lapse imaging of novel transgenic strains to visualise the formation of distal junctions and their subsequent fusion with proximal junctions. Through functional analyses of Arp2/3 and actomyosin contractility, the study identifies JBLs as localized mechanical hubs, where protrusive forces drive distal junction formation, and actomyosin contractility brings together the distal and proximal junctions. This forward movement provides a unique directionality which would contribute to proper lumen formation, EC orientation, and vessel stability during these early stages of vessel development.

      Time-lapse live imaging of VEC, ZO-1, and actin reveals that VEC and ZO-1 are initially deposited at the distal junction, while actin primarily localizes to the region between the proximal and distal sites. Using a photoconvertible Cdh5-mClav2 transgenic line, the origin of the VEC aggregates was examined. This convincingly shows that VE-cadherin was derived from pools outside the proximal junctions. However, in addition to de novo VEC derived from within the photoconverted cell, could some VEC also be contributed by the neighbouring endothelial cell to which the JBL is connected?

      As seen for JAILs in cultured ECs, the study reveals that Arp2/3 is enhanced when JBLs form by live imaging of Arpc1b-Venus in conjunction with ZO-1 and actin. Therefore Arp2/3 likely contributes to the initial formation of the distal junction in the lamellopodium.

      Inhibiting Arp2/3 with CK666 prevents JBL formation, and filopodia form instead of lamellopodia. This loss of JBLs leads to impaired EC rearrangements.

      Is the effect of CK666 treatment reversible? Since only a short (30 min) treatment is used, the overall effect on the embryo would be minimal, and thus washing out CK666 might lead to JBL formation and normalized rearrangements, which would further support the role of Arp2/3.

      From the images in Figure 4d it appears that ZO-1 levels are increased in the ring after CK666 treatment. Has this been investigated, and could this overall stabilization of adhesion proteins further prevent elongation of the ring?

      To explore how the distal and proximal junctions merge, imaging of spatiotemporal imaging of Myl9 and VEC is conducted. It indicates that Myl9 is localized at the interjunctional fusion site prior to fusion. This suggests pulling forces are at play to merge the junctions, and indeed Y 27632 treatment reduces or blocks the merging of these junctions.

      For this experiment, a truncated version of VEC was use,d which lacks the cytoplasmic domain. Why have the authors chosen to image this line, since lacking the cytoplasmic domain could also impair the efficiency of tension on VEC at both junction sites? This is as described in the discussion (lines 328-332).

      Since the time-lapse movies involve high-speed imaging of rather small structures, it is understandable that these are difficult to interpret. Adding labels to indicate certain structures or proteins at essential timepoints in the movies would help the readers understand these.

    1. eLife Assessment

      The authors of this manuscript study the transcriptional regulators that allow macrophages to assume different functional phenotypes in response to immune stimuli. They generate a computational map of the gene regulatory networks involved in determining macrophage phenotypes and experimentally validate the role of putative regulatory factors in a myeloid cell line. This study represents a valuable approach to understanding how gene regulation impacts macrophage polarization and their conclusions are supported by solid computational and experimental evidence. The revision has clarified that the focus is the identification of the regulatory barcodes in a myeloid cell line. Future studies in primary cells and in vivo will be required to assess the roles of these regulators in a broader context.

    2. Reviewer #1 (Public Review):

      Summary:

      Ravichandran et al investigate the regulatory panels that determine the polarization state of macrophages. They identify regulatory factors involved in M1 and M2 polarization states by using their network analysis pipeline. They demonstrate that a set of three regulatory factors (RFs) i.e., CEBPB, NFE2L2, and BCL3 can change macrophage polarization from the M1 state to the M2 state. They also show that siRNA-mediated knockdown of those 3-RF in THP1-derived M0 cells, in the presence of M1 stimulant increases the expression of M2 markers and showed decreased bactericidal effect. This study provides an elegant computational framework to explore the macrophage heterogeneity upon different external stimuli and adds an interesting approach to understanding the dynamics of macrophage phenotypes after pathogen challenge.

      Strengths:

      This study identified new regulatory factors involved in M1 to M2 macrophage polarization. The authors used their own network analysis pipeline to analyze the available datasets. The authors showed 13 different clusters of macrophages that encounter different external stimuli, which is interesting and could be translationally relevant as in physiological conditions after pathogen challenge, the body shows dynamic changes in different cytokines/chemokines that could lead to different polarization states of macrophages. The authors validated their primary computational findings with in vitro assays by knocking down the three regulatory factors-NCB.

    3. Reviewer #2 (Public Review):

      Summary:

      The authors of this manuscript address an important question regarding how macrophages respond to external stimuli to create different functional phenotypes, also known as macrophage polarization. Although this has been studied extensively, the authors argue that the transcription factors that mediate the change in state in response to a specific trigger remain unknown. They create a "master" human gene regulatory network and then analyze existing gene expression data consisting of PBMC-derived macrophage response to 28 stimuli, which they sort into thirteen different states defined by perturbed gene expression networks. They then identify the top transcription factors involved in each response that have the strongest predicted association with the perturbation patterns they identify. Finally, using S. aureus infection as one example of a stimulus that macrophages respond to, they infect THP-1 cells while perturbing regulatory factors that they have identified and show that these factors have a functional effect on the macrophage response.

      Strengths:

      The computational work done to create a "master" hGRN, response networks for each of the 28 stimuli studied, and the clustering of stimuli into 13 macrophage states is useful. The data generated will be a helpful resource for researchers who want to determine the regulatory factors involved in response to a particular stimulus and could serve as a hypothesis generator for future studies.

      The streamlined system used here - macrophages in culture responding to a single stimulus - is useful for removing confounding factors and studying the elements involved in response to each stimulus.

      The use of a functional study with S. aureus infection is helpful to provide proof of principle that the authors' computational analysis generates data that is testable and valid for in vitro analysis.

      [Reviewing Editor comments on revised version: the authors have made minimal changes and we have made a modest modification to the eLife Assessment, without returning the revised version to the original reviewers.]

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Ravichandran et al investigate the regulatory panels that determine the polarization state of macrophages. They identify regulatory factors involved in M1 and M2 polarization states by using their network analysis pipeline. They demonstrate that a set of three regulatory factors (RFs) i.e., CEBPB, NFE2L2, and BCL3 can change macrophage polarization from the M1 state to the M2 state. They also show that siRNA-mediated knockdown of those 3-RF in THP1-derived M0 cells, in the presence of M1 stimulant increases the expression of M2 markers and showed decreased bactericidal effect. This study provides an elegant computational framework to explore the macrophage heterogeneity upon different external stimuli and adds an interesting approach to understanding the dynamics of macrophage phenotypes after pathogen challenge.

      Strengths:

      This study identified new regulatory factors involved in M1 to M2 macrophage polarization. The authors used their own network analysis pipeline to analyze the available datasets. The authors showed 13 different clusters of macrophages that encounter different external stimuli, which is interesting and could be translationally relevant as in physiological conditions after pathogen challenge, the body shows dynamic changes in different cytokines/chemokines that could lead to different polarization states of macrophages. The authors validated their primary computational findings with in vitro assays by knocking down the three regulatory factors-NCB.

      We thank the reviewer for reading our manuscript and for the encouraging comments.

      Weaknesses:

      One weakness of the paper is the insufficient analysis performed on all the clusters. They used macrophages treated with 28 distinct stimuli, which included a very interesting combination of pro- and anti-inflammatory cytokines/factors that can be very important in the context of in vivo pathogen challenge, but they did not characterize the full spectrum of clusters. 

      We have performed a functional enrichment analysis of all the clusters and added a section describing the results (Fig 1B). We believe this work will provide a basis for future experiments to characterize other clusters.

      We have also performed a Principal Component Analysis (PCA) using hall mark genes of inflammation and the NCB panel alone to show the relative position of all clusters with respect to each other

      Although they mentioned that their identified regulatory panels could determine the precise polarization state, they restricted their analysis to only the two well-established macrophage polarization states, M1 and M2. Analyzing the other states beyond M1 and M2 could substantially advance the field. They mentioned the regulatory factors involved in individual clusters but did not study the potential pathway involving the target genes of these regulatory factors, which can show the importance of different macrophage polarization states. Importantly, these findings were not validated in primary cells or using in vivo models.

      We agree it would be useful to demonstrate the polarization switch in other systems as well. However, it is currently infeasible for us to perform these experiments. 

      Reviewer #2 (Public Review):

      Summary:

      The authors of this manuscript address an important question regarding how macrophages respond to external stimuli to create different functional phenotypes, also known as macrophage polarization. Although this has been studied extensively, the authors argue that the transcription factors that mediate the change in state in response to a specific trigger remain unknown. They create a "master" human gene regulatory network and then analyze existing gene expression data consisting of PBMC-derived macrophage response to 28 stimuli, which they sort into thirteen different states defined by perturbed gene expression networks. They then identify the top transcription factors involved in each response that have the strongest predicted association with the perturbation patterns they identify. Finally, using S. aureus infection as one example of a stimulus that macrophages respond to, they infect THP-1 cells while perturbing regulatory factors that they have identified and show that these factors have a functional effect on the macrophage response.

      Strengths:

      The computational work done to create a "master" hGRN, response networks for each of the 28 stimuli studied, and the clustering of stimuli into 13 macrophage states is useful. The data generated will be a helpful resource for researchers who want to determine the regulatory factors involved in response to a particular stimulus and could serve as a hypothesis generator for future studies.

      The streamlined system used here - macrophages in culture responding to a single stimulus - is useful for removing confounding factors and studying the elements involved in response to each stimulus.

      The use of a functional study with S. aureus infection is helpful to provide proof of principle that the authors' computational analysis generates data that is testable and valid for in vitro analysis.

      We thank the reviewer for reading our manuscript and for the encouraging comments

      Weaknesses:

      Although a streamlined system is helpful for interrogating responses to a stimulus without the confounding effects of other factors, the reality is that macrophages respond to these stimuli within a niche and while interacting with other cell types. The functional analysis shown is just the first step in testing a hypothesis generated from this data and should be followed with analysis in primary human cells or in an in vivo model system if possible.

      It would be helpful for the authors to determine whether the effects they see in the THP-1 immortalized cell line are reproduced in another macrophage cell line, or ideally in PBMC-derived macrophages.

      We agree; It would be useful in the future to demonstrate the polarization switch in other systems as well. We believe the results we provide here will inform future studies on other systems. 

      The paper would benefit from an expanded explanation of the network mining approach used, as well as the cluster stability analysis and the Epitracer analysis. Although these approaches may be published elsewhere, readers with a non-computational background would benefit from additional descriptions.

      We have elaborated on the network mining approach and added a schematic diagram (Fig S13) to describe the EpiTracer algorithm.

      Although the authors identify 13 different polarization states, they return to the iM0/M1/M2 paradigm for their validation and functional assays. It would be useful to comment on the broader applications of a 13-state model.

      We have included a new figure panel describing the functional enrichment analysis of all the clusters (Fig 1B) and added a section describing the results. We have also performed a Principal Component Analysis (PCA) using hallmark gene of inflammation and the NCB panel alone to show the relative position of all clusters with respect to each other. The PCA plot shows that C11(M1) and C3(M2) are roughly at two extreme ends, with other clusters between them, forming something resembling a punctuated continuum of states.

      The relative contributions of each "switching factor" to the phenotype remain unclear, especially as knocking out each individual factor changes different aspects of the model (Fig. S5).

      Fig S5 shows the effect on phenotype upon individual knockdown of the switching factors, from which we deduce that CEBPB has the largest contribution in determining the phenotype. However, we maintain that all three genes are necessary as a panel for M1/M2 switching. 

      Reviewer #1 (Recommendations For The Authors):

      The manuscript by Ravichandran et al describes the networks of genes that they named j"RF" associated with M1 to M2 polarization of macrophages by using their computational pipelines. They have shown 13 clusters of human macrophage polarization state by using an available database of different combinatorial treatments with cytokines, endotoxin, or growth factors, which is interesting and could be useful in the research field. However, there are a few comments which will help to understand the subject more precisely.

      (1,2) The authors claimed to identify key regulatory factors involved in the human macrophage polarization from M1 to M2. However, recent advances suggest that macrophage polarization cannot be restricted to M1 and M2 only, which is also supported by the authors' data that shows 13 clusters of macrophages. However, they only focused on the difference between clusters 11 and 3 considering conventional M1 and M2. It will be more interesting to analyze the other clusters and how they relate to the established and simplistic M1 and M2 paradigms.

      It will be interesting to know if they found any difference in the enriched pathways among these different clusters considering the exclusive regulatory factors and their targets.

      We appreciate the point and have addressed it as follows. In the revised manuscript, we have discussed the clusters in detail and have provided the key regulatory factors (RF) combinations and target genes that define distinct macrophage population states (Please refer: Data file S2, S3). We have also discussed the associated immunological processes with each cluster, particularly in relation to the C11 and C3 clusters. We have added a new panel in Fig 1 to illustrate a heatmap indicating the enrichment of pathways relevant to inflammation in each of the clusters (Fig 1B).   Indeed, there is a substantial difference in the enrichment terms between the extreme ends (M1, M2) and significant differences in some of the pathways between clusters.   

      (3) The authors have shown the involvement of NCB at 72h post LPS treatment. Are these RF involved in late response genes or act at the earlier time point of LPS treatment? Understanding the RF involvement in the dynamic response of macrophages to any stimulant will be important.

      Using the data available for different time points (30 mins to 72 hours), we plotted the fold change (with respect to unstimulated cells) in M1 and M2 clusters for each of the NCB genes and observe clear divergence in the trend at 24 hours and have provided them as newly added (Supplementary Figure 9  A, B, C).

      (4) The authors showed that the knockdown of RF- NCB can switch the M1 to M2. However, they showed a few conventional markers known to be M2 markers. What happens if NCB is overexpressed or knocked down in other treatment conditions/other clusters? Is the RF-NCB only involved in these two specific stimulations or their overexpression can promote M2 polarization in any given stimuli?

      It is an interesting question but for practical reasons, experimental work was limited to M1 and M2 clusters as the aim was to establish proof of concept and could not be scaled up for all clusters, which would require a large amount of work and possibly a separate study.  We believe the description of the clusters that we have provided will enable the design of future experiments that will throw light on the significance of the intermediate clusters.  

      (5) The authors have shown that knockdown of RF- NCB decreases pathogen clearance, but what are their altered functions? Are they more efficient in cellular debris clearance or resolution of inflammation? The authors can check the mRNA expression of markers/cytokines involved in those processes, in the NCB knockdown condition.

      Indeed. Expression levels were measured for the following genes: CXCL2, IL1B, iNOS, SOCS3 (which are pro-inflammatory markers), as well as MRC1, ARG1, TGFB, IL10 (anti-inflammatory markers), as shown in Fig 4B.  

      Minor comments:

      (1, 2). How the authors evaluate the performance of their knowledge-based gene network. The authors should write the methods in detail, how they generated the simulated network, and evaluated the simulated dataset.

      Gene network construction and module detection have many tools available. The authors need to mention which one they used. The authors should show whether their findings are consistent with at least another two module-detection methods (eg; "RedeR") to strengthen their claim.

      We have added a schematic figure (Supplementary Fig S11) and detailed description of network construction and mining in the Methods section, as follows: We have reconstructed a comprehensive knowledge-based human Gene Regulatory Network (hGRN), which consists of Regulatory Factors (RF) to Target Gene (TG) and RF to RF interactions. To achieve this, we curated experimentally determined regulatory interactions (RF-TG, RF-RF) associated with human regulatory factors (Wingender et al., 2013). These interactions were sourced from several resources, including: (a) literature-curated resources like the Human Transcriptional Regulation Interactions database (HTRIdb) (Bovolenta et al., 2012), Regulatory Network Repository (RegNetwork) (Liu et al., 2015), Transcriptional Regulatory Relationships Unraveled by Sentence-based Text-mining (TRRUST) (Han et al., 2015), and the TRANSFAC resource from Harmonizome (Rouillard et al., 2016);  (b) ChEA3, which contains ChIP-seq determined interactions (Keenan et al., 2019); and (c) high-confidence protein-protein binding interactions (RF-RF) from the human protein-protein interaction network-2 (hPPiN2) (Ravichandran et al., 2021). As a result, our hGRN comprises 27,702 nodes and 890,991 interactions.  It is important to note that none of the edges/interactions in the hGRN are data-driven. We utilized this extensive hGRN, which encompasses the experimentally determined interactions/edges, to infer stimulant-specific hGRNs and top paths using our in-house network mining algorithm, ResponseNet. We have previously demonstrated that ResponseNet, which utilizes a knowledge-based network and a sensitive interrogation algorithm, outperformed data-driven network inference methods in capturing biologically relevant processes and genes, whose validation is reported earlier (Ravichandran and Chandra, 2019; Sambaturu et al., 2021).

      We utilized our in-house response network approach to identify the stimulant-specific top active and repressed perturbations (Ravichandran and Chandra, 2019; Sambaturu et al., 2021). This is clearly described in the revised manuscript. To summarize, we generated stimulant-specific Gene Regulatory Networks (GRNs) by applying weights to the master human Gene Regulatory Network (hGRN) based on differential transcriptomic responses to stimulants (i.e., comparing stimulant-treated conditions to baseline). We then produced individually weighted networks for each stimulant and implemented a refined network mining technique to extract the most significant pathways. Furthermore, we have previously conducted a systematic comparison of our network mining strategy with other data-driven module detection methods, including jActiveModules (Ideker et al, 2002), WGCNA (Langfelder et al, 2008), and ARACNE (Margolin et al, 2006). Our findings demonstrated that our approach outperformed conventional data-driven network inference methods in capturing the biologically pertinent processes and genes (Ravichandran and Chandra, 2019). Since we have experimentally validated what we predicted from the network analysis, we do not see a need for performing the computational analysis with another algorithm. Moreover, different network analyses are based on different aspects of identifying functionally relevant genes or subnetworks. While each of them output useful information, given the scale of the network and the number of different biologically significant subnetworks and genes that could be present in an unbiased network such as what we have used, the output from different methods need not agree with each other as they may capture different aspects all together and hence is not guaranteed to be informative.  

      (3) Representation of Fig 2B is difficult to understand the authors' interpretation of 'the 3-RF combination has 1293 targets, 359 covering about 53% of the top-perturbed network' for general readers. If the authors can simplify the interpretation will be helpful for the readers.

      This is replaced with clearer figures in the revised manuscript (Figure 2A, 2B), and the associated text is also rephrased for clarity.

      Reviewer #2 (Recommendations For The Authors):

      Major comments:

      (1) It would be helpful for the authors to determine whether the effects they see in the THP-1 immortalized cell line are reproduced in another macrophage cell line, or ideally in PBMC-derived macrophages if this is feasible. If using PBMC- or bone marrow-derived macrophages is beyond the scope of what the authors can reasonably perform, they could consider using another macrophage cell line such as RAW 264.7 cells, which would also provide orthogonal validation from a mouse model.

      At this point of time, it is unfortunately infeasible for us to perform these experiments, due to resource limitation.  Moreover, it would require a lot of time. We hope that our work provides pointers for anyone working on mouse models or other model systems to design their studies on regulatory controls and the aspect of generalizability of our findings in Thp-1 cell lines to other systems will eventually emerge.

      (2) It would be helpful for the authors to provide an expanded explanation of the network mining approach used, as well as the cluster stability analysis and the Epitracer analysis. Although these approaches may be published elsewhere, readers with a non-computational background would benefit from additional descriptions. A schematic figure would also be helpful to clarify their approach.

      We have added a new schematic diagram in Supplementary figures (S13) and a detailed text in the Methods section describing the network mining analysis and epitracer identification in the revised manuscript. 

      (3) It would be helpful for the authors to comment on whether the thirteen polarization states that they identify align with other analyses that have been performed using data collected from stimulated macrophages, or whether this is a novel finding, especially as the original paper from which the primary data are derived identified 9 clusters. More broadly, since the authors eventually return to the M1-M2 paradigm, it is unclear whether there is any functional support for a 13-state model - it is also possible that macrophages exist along a continuum of stimulation states rather than in discrete clusters. This at least merits further discussion, which could focus on different axes of polarization as discussed and shown in the original paper.

      As described in the manuscript, Clustering based on the differential transcriptome profile of RF-set1, which contains 265 transcription factors (TFs), in response to 28 stimulants, resulted in 13 distinct clusters. The cluster member associations inferred from RF-set1 were similar in number and pattern to those inferred from the entire differential transcriptome (n=12,164; Fig. S2, cophenetic coefficient = 0.68; p-value = 1.25e−51). Furthermore, the inferred cluster pattern largely matched the clustering pattern previously described for the same dataset  (Xue et al., 2014).  Our contribution: The pattern we observed from the top-ranked epicenters in each cluster suggests that a subset of differentially expressed genes (DEGs) present in our top networks is sufficient for achieving differentiation. Our gene-regulatory models suggest that saturated (SA and PA) and unsaturated (LA, LiA, and OA) fatty acids, which were previously grouped together, mediate distinct modes of resolution and are now separated into two sub-branches. Similarly, the effects of IFNγ and sLPS, previously combined, are now distinctly resolved, aligning with known regulatory differences (Hoeksema et al., 2015; Kang et al., 2019). 

      The principal takeaway from this analysis is not the exact number of clusters but rather the molecular basis it provides for the differentiation of functional states, with M1 and M2 representing two ends of the spectrum. Several other states are dispersed within the polarization spectrum, which we describe as a punctuated continuum. For our switching studies, we focused on clusters C11 (M1-like) and C2 (M2-like) due to their established functional relevance. However, future studies are required to explore the functional relevance of other clusters. We have added a discussion on this aspect as suggested.

      (4) It would be helpful to define the contribution of each component of the NCB group to M1 polarization.

      We assessed the impact of CEBPB, NFE2L2, and BCL3 on C2 (M1-like) polarization states by quantifying the expression levels of M1 and M2 markers. Our findings indicate that knocking down CEBPB led to a significant downregulation in the expression of M1 markers and an increase in M2 marker expression. In contrast, NFE2L2 and BCL3 knockdown resulted in decreased expression of M1 markers without a corresponding significant increase in M2 markers. These results suggest that CEBPB is crucial for M1 to the M2 transition. We have added a note on pg 22 to emphasize this better.

      (5) NRF2, CEBPb, and BCL3 all have well-described roles in macrophage polarization. To add clarity to their discussion, the authors should cite relevant literature (eg PMIDs 15465827, 27211851, and others) and discuss how their findings extend what is currently known about the contribution of these individual proteins to macrophage responses.

      The role of NFE2L2, CEBPB and BCL3 in macrophage polarization and state transition are described in the discussion section. The PMIDs mentioned by the reviewer are added as well. 

      (6) The effect size of NCB knockdown in the in vitro Staph aureus model shown in 4C is fairly small - bacterial killing assays typically require at least a log of difference to demonstrate a convincing effect. It would be helpful for the authors to include a positive control for this experiment (for example, STAT4) to frame the magnitude of their effect.

      We thank the reviewer for the comment, however, we would like to point out that the difference in CFU plotted in log<sub>10</sub> scale, as per common practice. The CFUs are therefore almost halved due to the knockdown in absolute scale and reproduced multiple times with statistically significant results (p-value <0.01). We feel it is sufficient to demonstrate that the NCB geneset by themselves bring out a change in polarization and hence the killing effect. We have used STAT4 as a control for marker measurements as shown in Fig 3C. While carrying out CFU with siSTAT4 may add additional information, we have proceeded to perform the infection experiments with and without the NCB knockdown as that remains the main focus of the study. 

      Minor recommendations:

      (1) Is there a difference between the data represented in Figure 1A-B and Figure S1? If this is the same data, there is no need to repeat it, and Figure 1 could be composed only of the current panels C and D.

      We have removed Figure1 A and B as it illustrates the same point as Figure S1. We have retained Figures C and D and renamed them as new Figure 1A and C. In addition, we have added a new panel Fig 1B (in response to earlier points). 

      (2) Could Figure 2B be represented in a different way? The circles do not contain any readable information about the genes, and it may be less visually overwhelming to represent this with just the large and small triangles. Perhaps the individual genes represented by the circles could be listed in a supplemental table or Excel file.

      We have provided a new Figure 2 A and B panels for the M1 and M2 clusters respectively, which has only the barcode genes along with a functional annotation. The full network is already provided in supplementary data. 

      (3) When indicating the N for all experiments performed in the figure legends, the authors should indicate whether these were technical or biological replicates.

      We appreciate the reviewers for the suggestion. We have indicated what N is for all figure legends.

      (4) Fig 3B: the y-axis is confusing - it appears that normalization is actually to the untreated cells.

      Yes indeed. The normalization is with respect to the untreated cells as per standard practice. We have indicated this clearly in the legend.

      (5) The 72-hour time point in Fig S8 shows unexpected results. Could the authors explain or propose a hypothesis for why CXCL2 and IL1b abruptly decrease while iNOS and MRC1 abruptly increase?

      The purpose of the mentioned experiment was to standardize the time point of M1 polarization post S. aureus  infection. In this regard,  we profiled the expression levels of markers at various time points. We chose to study the 24 hour time point for all the future experiments based on the significant upregulation of NCB seen in the macrophages.  We believe that the 72 hour time point may show effects that are different since the initial immune response would have waned leading to differences in cytokine dynamics. However, as this is not the focus of our study, we are not discussing this aspect further.

    1. eLife Assessment

      This important study substantially advances our understanding of pediatric Crohn's disease, mapping the cellular make-up of this disease and how patients respond to treatment. The evidence supporting the conclusions is compelling, with thorough bioinformatic analyses, underpinned by rigorous methodology and data integration. The work will be of broad interest to pediatric clinicians, immunologists and bioinformaticians.

    2. Reviewer #1 (Public review):

      Summary:

      Crohn's disease is a prevalent inflammatory bowel disease that often results in patient relapse post anti-TNF blockades. This study employs a multifaceted approach utilizing single-cell RNA sequencing, flow cytometry, and histological analyses to elucidate the cellular alterations in pediatric Crohn's disease patients pre and post anti-TNF treatment and comparing them with non-inflamed pediatric controls. Utilizing an innovative clustering approach, , the research distinguishes distinct cellular states that signify the disease's progression and response to treatment. Notably, the study suggests that the anti-TNF treatment pushes pediatric patients towards a cellular state resembling adult patients with persistent relapse. This study's depth offers a nuanced understanding of cell states in CD progression that might forecast the disease trajectory and therapy response.

      Robust Data Integration: The authors adeptly integrate diverse data types: scRNA-seq, histological images, flow cytometry, and clinical metadata, providing a holistic view of the disease mechanism and response to treatment.

      Novel Clustering Approach: The introduction and utilization of ARBOL, a tiered clustering approach, enhances the granularity and reliability of cell type identification from scRNA-seq data.

      Clinical Relevance: By associating scRNA-seq findings with clinical metadata, the study offers potentially significant insights into the trajectory of disease severity and anti-TNF response; might help with the personalized treatment regimens.

      Treatment Dynamics: The transition of the pediatric cellular ecosystem towards an adult, more treatment-refractory state upon anti-TNF treatment is a significant finding. It would be beneficial to probe deeper into the temporal dynamics and the mechanisms underlying this transition.

      Comparative Analysis with Adult CD: The positioning of on-treatment biopsies between treatment-naïve pediCD and on-treatment adult CD is intriguing. A more in-depth exploration comparing pediatric and adult cellular ecosystems could provide valuable insights into disease evolution.

      Areas of improvement:

      (1) The legends accompanying the figures are quite concise. It would be beneficial to provide a more detailed description within the legends, incorporating specifics about the experiments conducted and a clearer representation of the data points.

      (2) Statistical significance is missing from Fig. 1c WBC count plot, Fig. 2 b-e panels. Please provide even if its not significant. Also, legend should have the details of stat test used.

      (3) In the study, the NOA group is characterized by patients who, after thorough clinical evaluations, were deemed to exhibit milder symptoms, negating the need for anti-TNF prescriptions. This mild nature could potentially align the NOA group closer to FIGD-a condition intrinsically defined by its low to non-inflammatory characteristics. Such an alignment sparks curiosity: is there a marked correlation between these two groups? A preliminary observation suggesting such a relationship can be spotted in Figure 6, particularly panels A and B. Given the prevalence of FIGD among the pediatric population, it might be prudent for the authors to delve deeper into this potential overlap, as insights gained from mild-CD cases could provide valuable information for managing FIGD.

      (4) Furthermore, Figure 7 employs multi-dimensional immunofluorescence to compare CD, encompassing all its subtypes, with FIGD. If the data permits, subdividing CD into PR, FR, and NOA for this comparison could offer a more nuanced understanding of the disease spectrum. Such a granular perspective is invaluable for clinical assessments. The key question then remains: do the sample categorizations for the immunofluorescence study accommodate this proposed stratification?

      (5) The study's most captivating revelation is the proximity of anti-TNF treated pediatric CD (pediCD) biopsies to adult treatment-refractory CD. Such an observation naturally raises the question: How does this alignment compare to a standard adult colon, and what proportion of this similarity is genuinely disease-specific versus reflective of an adult state? To what degree does the similarity highlight disease-specific traits?

      Delving deeper, it will be of interest to see whether anti-TNF treatment is nudging the transcriptional state of the cells towards a more mature adult stage or veering them into a treatment-resistant trajectory. If anti-TNF therapy is indeed steering cells toward a more adult-like state, it might signify a natural maturation process; however, if it's directing them toward a treatment-refractory state, the long-term therapeutic strategies for pediatric patients might need reconsideration.

      Comments on revisions:

      I have no further comments. I am satisfied with the revisions.

    3. Reviewer #2 (Public review):

      Summary:

      Through this study the authors combine a number of innovative technologies including scRNAseq to provide insight into Crohn's disease. Importantly, samples from pediatric patients are included. The authors develop a principled and unbiased tiered clustering approach, termed ARBOL. Through high-resolution scRNAseq analysis the authors identify differences in cell subsets and states during pediCD relative to FGID. The authors provide histology data demonstrating T cell localisation within the epithelium. Importantly, the authors find anti-TNF treatment pushes the pediatric cellular ecosystem towards an adult state.

      Strengths:

      This study is well presented. The introduction clearly explains the important knowledge gaps in the field, the importance of this research, the samples that are used and study design.<br /> The results clearly explain the data, without overstating any findings. The data is well presented. The discussion expands on key findings and any limitations to the study are clearly explained.

      I think the biological findings from and bioinformatic approach used in, this study, will be of interest to many and significantly add to the field.

      Weaknesses:

      (1) The ARBOL approach for iterative tiered clustering on a specific disease condition was demonstrated to work very well on the datasets generated in this study where there were no obvious batch effects across patients. What if strong batch effects are present across donors where PCA fails to mitigate such effects? Are there any batch correction tools implemented in ARBOL for such cases?

      The authors have addressed this comment during review

      (2) The authors mentioned that the clustering tree from the recursive sub-clustering contained too much noise, and they therefore used another approach to build a hierarchical clustering tree for the bottom-level clusters based on unified gene space. But in general, how consistent are these two trees?

      The authors have addressed this comment during review

      Comments on revisions:

      I have no additional comments. The authors addressed my previous comments well.

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Crohn's disease is a prevalent inflammatory bowel disease that often results in patient relapse post anti-TNF blockades. This study employs a multifaceted approach utilizing single-cell RNA sequencing, flow cytometry, and histological analyses to elucidate the cellular alterations in pediatric Crohn's disease patients pre and post-anti-TNF treatment and comparing them with non-inflamed pediatric controls. Utilizing an innovative clustering approach, the research distinguishes distinct cellular states that signify the disease's progression and response to treatment. Notably, the study suggests that the anti-TNF treatment pushes pediatric patients towards a cellular state resembling adult patients with persistent relapses. This study's depth offers a nuanced understanding of cell states in CD progression that might forecast the disease trajectory and therapy response.

      Robust Data Integration: The authors adeptly integrate diverse data types: scRNA-seq, histological images, flow cytometry, and clinical metadata, providing a holistic view of the disease mechanism and response to treatment.

      Novel Clustering Approach: The introduction and utilization of ARBOL, a tiered clustering approach, enhances the granularity and reliability of cell type identification from scRNA-seq data.

      Clinical Relevance: By associating scRNA-seq findings with clinical metadata, the study offers potentially significant insights into the trajectory of disease severity and anti-TNF response; which might help with the personalized treatment regimens.

      Treatment Dynamics: The transition of the pediatric cellular ecosystem towards an adult, more treatment-refractory state upon anti-TNF treatment is a significant finding. It would be beneficial to probe deeper into the temporal dynamics and the mechanisms underlying this transition.

      Comparative Analysis with Adult CD: The positioning of on-treatment biopsies between treatment-naïve pediCD and on-treatment adult CD is intriguing. A more in-depth exploration comparing pediatric and adult cellular ecosystems could provide valuable insights into disease evolution.

      Areas of improvement:

      (1) The legends accompanying the figures are quite concise. It would be beneficial to provide a more detailed description within the legends, incorporating specifics about the experiments conducted and a clearer representation of the data points. 

      We agree that it is beneficial to have descriptive figure legends that balance elements of experimental design, methodology, and statistical analyses employed in order to have a clear understanding throughout the manuscript. We have gone through and clarified areas throughout.  

      (2) Statistical significance is missing from Fig. 1c WBC count plot, Fig. 2 b-e panels. Please provide it even if it's not significant. Also, the legend should have the details of stat test used.

      We have now added details of statistical significance data in the Figure 1 legends. Please note that Mann-Whitney U-test was used for clinical categorical data.

      (3) In the study, the NOA group is characterized by patients who, after thorough clinical evaluations, were deemed to exhibit milder symptoms, negating the need for anti-TNF prescriptions. This mild nature could potentially align the NOA group closer to FGID-a condition intrinsically defined by its low to non-inflammatory characteristics. Such an alignment sparks curiosity: is there a marked correlation between these two groups? A preliminary observation suggesting such a relationship can be spotted in Figure 6, particularly panels A and B. Given the prevalence of FGID among the pediatric population, it might be prudent for the authors to delve deeper into this potential overlap, as insights gained from mild-CD cases could provide valuable information for managing FGID.

      Thank you for this insightful point. On histopathology and endoscopy, the NOA exhibited microscopic and macroscopic inflammation which landed these patients with the CD diagnosis, albeit mild on both micro and macro accounts. By contrast, the FGID group by definition will not have inflammation of microscopic and macroscopic evaluation. There is great interest in the field of adult and pediatric gastroenterology to understand why patients develop symptoms without evidence of inflammation. However, in 2023 the diagnostic tools of endoscopy with biopsy and histopathology is not sensitive enough to detect transcript level inflammation, positioning single-cell technology to be able to reveal further information in both disease processes.

      Based on the reviewer’s suggestions, we have calculated a heatmap of overlapping NOA and FGID cell states along the Figure 6a joint-PC1, showing where NOA CD patients and FGID patients overlap in terms of cell states. This is displayed in Supplemental Figure 15d. This revealed a set of T, Myeloid, and Epithelial cell states that were most important in describing variance along the FGID-CD axis, allowing us to hone in on similarities at the boundary between FGID and CD. By comparing the joint cell states with CD atlas curated cluster names, we identified CCR7-expressing T cell states and GSTA2-expressing epithelial states associated with this overlap. 

      (4) Furthermore, Figure 7 employs multi-dimensional immunofluorescence to compare CD, encompassing all its subtypes, with FGID. If the data permits, subdividing CD into PR, FR, and NOA for this comparison could offer a more nuanced understanding of the disease spectrum. Such a granular perspective is invaluable for clinical assessments. The key question then remains: do the sample categorizations for the immunofluorescence study accommodate this proposed stratification?

      Thank you for the thoughtful discussion. We agree that stratifying Crohn’s disease by PR, FR, and NOA would provide valuable clinical insight. Unfortunately our multiplex IF cohort was designed to maximize overall CD versus FGID comparisons and does not contain enough samples in patient subgroups to power such an analysis. We have highlighted this limitation in the text.  

      (5)The study's most captivating revelation is the proximity of anti-TNF-treated pediatric CD (pediCD) biopsies to adult treatment-refractory CD. Such an observation naturally raises the question: How does this alignment compare to a standard adult colon, and what proportion of this similarity is genuinely disease-specific versus reflective of an adult state? To what degree does the similarity highlight disease-specific traits?

      Delving deeper, it will be of interest to see whether anti-TNF treatment is nudging the transcriptional state of the cells towards a more mature adult stage or veering them into a treatment-resistant trajectory. If anti-TNF therapy is indeed steering cells toward a more adult-like state, it might signify a natural maturation process; however, if it's directing them toward a treatment-refractory state, the long-term therapeutic strategies for pediatric patients might need reconsideration.

      Thank you to the reviewer for another insightful point. We agree that age-matched samples are critical to evaluate disease cell states and hence we have age-matched controls in our pediatric cohort. Our timeline of follow-up only spans 3 years and patients remain in the pediatric age range at times of follow-up endoscopy and biopsy and would not be reflective of an adult GI state. We believe that the cellular behavior from naïve to treatment biopsy to on treatment biopsy is reflective of disease state rather than movement towards and adult-like state. We would also like to point out that pediatric onset IBD (Crohn’s and ulcerative colitis) traditionally has been harder to treat and presents with more extensive disease state (PMID: 22643596) and the ability to detect need for therapy escalation/change would be an invaluable tool for clinicians.  

      We share the reviewer’s interest in disentangling a natural maturation process from disease and treatment-specific changes. Because the patients who were not given treatment did not move towards the adult-like phenotype, it could point to a push towards a treatment-resistant trajectory. To further support these findings, we generated a new disease-pseudotime figure Supplemental Figure 17, using cross-validation methods and the TradeSeq package. This figure was designed to track how each pediatric sample shifts from the treatment-naïve state through antiTNF therapy and to test the robustness of these shifts across samples. The new visualizations show patterns that do not recapitulate natural aging processes but rather shifts across all cell types associated with antiTNF treatment.

      Reviewer #2 (Public Review):

      Summary:

      Through this study, the authors combine a number of innovative technologies including scRNAseq to provide insight into Crohn's disease. Importantly samples from pediatric patients are included. The authors develop a principled and unbiased tiered clustering approach, termed ARBOL. Through high-resolution scRNAseq analysis the authors identify differences in cell subsets and states during pediCD relative to FGID. The authors provide histology data demonstrating T cell localisation within the epithelium. Importantly, the authors find anti-TNF treatment pushes the pediatric cellular ecosystem toward an adult state.

      Strengths:

      This study is well presented. The introduction clearly explains the important knowledge gaps in the field, the importance of this research, the samples that are used, and study design.

      The results clearly explain the data, without overstating any findings. The data is well presented. The discussion expands on key findings and any limitations to the study are clearly explained.

      I think the biological findings from, and bioinformatic approach used in this study, will be of interest to many and significantly add to the field.

      Weaknesses:

      (1) The ARBOL approach for iterative tiered clustering on a specific disease condition was demonstrated to work very well on the datasets generated in this study where there were no obvious batch effects across patients. What if strong batch effects are present across donors where PCA fails to mitigate such effects? Are there any batch correction tools implemented in ARBOL for such cases?

      We thank the reviewer for their insightful point, the full extent to which ARBOL can address batch effects requires further study. To this end we integrated Harmony into the ARBOL architecture and used it in the paper to integrate a previous study with the data presented (Figure 8). We have added to ARBOL’s github README how to use Harmony with the automated clustering method. With ARBOL, as well as traditional clustering methods, batch effects can cause artifactual clustering at any tier of clustering. Due to iteration, this can cause batch effects to present themselves in a single round of clustering, followed by further rounds of clustering that appear highly similar within each batch subset. Harmony addresses this issue, removing these batch-related clustering rounds. The later arrangement of fine-grained clusters using the bottom-up approach can use the batch-corrected latent space to calculate relationships between cell states, removing the effects from both sides of the algorithm. As stated, the extent to which ARBOL can be used to systematically address these batch effects requires further research, but the algorithmic architecture of ARBOL is well suited to address these effects.

      (2) The authors mentioned that the clustering tree from the recursive sub-clustering contained too much noise, and they therefore used another approach to build a hierarchical clustering tree for the bottom-level clusters based on unified gene space. But in general, how consistent are these two trees?

      Thank you for this thoughtful question. The two tree methodologies are not consistent due to their algorithmic differences, but both are important for several reasons: 

      (1) The clustering tree is top-down, meaning low resolution lineage-related clusters are calculated first. Doublets and quality differences can cause very small clusters of different lineages (endothelial vs fibroblast) to fall under the incorrect lineage at first in the sub clustering tree, but these are recaptured during further sub clustering rounds, and then disentangled by the cluster-centroid tree.

      (2) The hierarchical tree is a rose tree, meaning each branching point can contain several daughter branches, while taxonomies based on distances between species (or cell types in this case) are binary trees with only 2 branches per branching point, because distances between each cluster are unique. Because this taxonomy, or bottom-up, is different from the top-down approach, it is useful to then look at how these bottom-level clusters are similar. To that end, we performed pair-wise differential expression between all end clusters and clustered based on those genes. 

      (3) Calculation of a binary tree represents a quantitative basis for comparing the transcriptomic distance between clusters as opposed to relying on distances calculated within a heuristic manifold such as UMAP or algorithmic similarity space such as cluster definitions based on KNN graphs.

      In practice, this dual view rescues small clusters that may have been mis-grouped by technical artifacts and gives a quantitative distance based hierarchy that can be compared across metadata covariates.

    1. eLife Assessment

      This important study provides solid evidence to support the anti-tumor potential of citalopram, originally an anti-depression drug, in hepatocellular carcinoma (HCC). In addition to their previous report on directly targeting tumor cells via glucose transporter 1 (GLUT1), the authors tried to uncover additional working mechanisms of citalopram in HCC treatment in the current study. The data here suggests that citalopram may regulate the phagocytotic function of TAM via C5aR1 or CD8+T cell function to suppress HCC growth in vivo.

    2. Reviewer #1 (Public review):

      Summary:

      In their previous publication (Dong et al. Cell Reports 2024), the authors showed that citalopram treatment resulted in reduced tumor size by binding to the E380 site of GLUT1 and inhibiting the glycolytic metabolism of HCC cells, instead of the classical citalopram receptor. Given that C5aR1 was also identified as the potential receptors of citalopram in the previous report, the authors focused on exploring the potential of immune-dependent anti-tumor effect of citalopram via C5aR1. C5aR1 was found to be expressed on tumor-associated macrophages (TAMs) and citalopram administration showed potential to improve the stability of C5aR1 in vitro. Through macrophage depletion and adoptive transfer approaches in HCC mouse models, the data demonstrated the potential importance of C5aR1-expressing macrophage in the anti-tumor effect of citalopram in vivo. Mechanistically, their in vitro data suggested that citalopram may regulate the phagocytosis potential and polarization of macrophages through C5aR1. Next, they tried to investigate the direct link between citalopram and CD8+T cells by including an additional MASH-associated HCC mouse model. Their data suggest that citalopram may upregulate the glycolytic metabolism of CD8+T cells, probability via GLUT3 but not GLUT1-mediated glucose uptake. Lastly, as the systemic 5-HT level is down-regulated by citalopram, the authors analyzed the association between a low 5-HT and a superior CD8+T cell function against tumor. Although the data is informative, the rationale for working on additional mechanisms and logical link among different parts are not clear. In addition, some of the conclusion is also not fully supported by the current data.

      Strengths:

      The idea of repurposing clinical-in-used drugs showed great potential for immediate clinical translation. The data here suggested that the anti-depression drug, citalopram displayed immune regulatory role on TAM via a new target C5aR1 in HCC.

      Comments on revised version:

      The authors have addressed most of my concerns about the paper.

    3. Reviewer #2 (Public review):

      Summary:

      Dong et al. present a thorough investigation into the potential of repurposing citalopram, an SSRI, for hepatocellular carcinoma (HCC) therapy. The study highlights the dual mechanisms by which citalopram exerts anti-tumor effects: reprogramming tumor-associated macrophages (TAMs) toward an anti-tumor phenotype via C5aR1 modulation and suppressing cancer cell metabolism through GLUT1 inhibition, while enhancing CD8+ T cell activation. The findings emphasize the potential of drug repurposing strategies and position C5aR1 as a promising immunotherapeutic target.

      Strengths:

      It provides detailed evidence of citalopram's non-canonical action on C5aR1, demonstrating its ability to modulate macrophage behavior and enhance CD8+ T cell cytotoxicity. The use of DARTS assays, in silico docking, and gene signature network analyses offers robust validation of drug-target interactions. Additionally, the dual focus on immune cell reprogramming and metabolic suppression presents a comprehensive strategy for HCC therapy. By highlighting the potential for existing drugs like citalopram to be repurposed, the study also emphasizes the feasibility of translational applications. During revision, the authors experimentally demonstrated that TAM has lower GLUT1, which further strengthens their claim of C5aR1 modulation-dependent TAM improvement for tumor therapy.

      Weaknesses:

      The authors proposed that CD8+ T cells have an TAM-independent role upon Citalropharm treatment. However, this claim requires further investigation to confirm that the effect is truly "TAM independent".

    4. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary:

      In their previous publication (Dong et al. Cell Reports 2024), the authors showed that citalopram treatment resulted in reduced tumor size by binding to the E380 site of GLUT1 and inhibiting the glycolytic metabolism of HCC cells, instead of the classical citalopram receptor. Given that C5aR1 was also identified as the potential receptor of citalopram in the previous report, the authors focused on exploring the potential of the immune-dependent anti-tumor effect of citalopram via C5aR1. C5aR1 was found to be expressed on tumor-associated macrophages (TAMs) and citalopram administration showed potential to improve the stability of C5aR1 in vitro. Through macrophage depletion and adoptive transfer approaches in HCC mouse models, the data demonstrated the potential importance of C5aR1-expressing macrophage in the anti-tumor effect of citalopram in vivo. Mechanistically, their in vitro data suggested that citalopram may regulate the phagocytosis potential and polarization of macrophages through C5aR1. Next, they tried to investigate the direct link between citalopram and CD8+T cells by including an additional MASH-associated HCC mouse model. Their data suggest that citalopram may upregulate the glycolytic metabolism of CD8+T cells, probability via GLUT3 but not GLUT1-mediated glucose uptake. Lastly, as the systemic 5-HT level is down-regulated by citalopram, the authors analyzed the association between a low 5-HT and a superior CD8+T cell function against a tumor. Although the data is informative, the rationale for working on additional mechanisms and logical links among different parts is not clear. In addition, some of the conclusion is also not fully supported by the current data. 

      We thank the reviewer for their comprehensive summary of our study and appreciate the valuable feedback. We have made improvements based on these comments, and a detailed response addressing each point is presented below.

      Strengths: 

      The idea of repurposing clinical-in-used drugs showed great potential for immediate clinical translation. The data here suggested that the anti-depression drug, citalopram displayed an immune regulatory role on TAM via a new target C5aR1 in HCC.

      We thank the reviewer for recognizing the strengths of our study.

      Weaknesses: 

      (1) The authors concluded that citalopram had a 'potential immune-dependent effect' based on the tumor weight difference between Rag-/- and C57 mice in Figure 1. However, tumor weight differences may also be attributed to a non-immune regulatory pathway. In addition, how do the authors calculate relative tumor weight? What is the rationale for using relative one but not absolute tumor weight to reflect the anti-tumor effect? 

      We appreciate your insights into the potential contributions of non-immune regulatory pathways to the observed tumor weight differences between Rag1<sup>-/- </sup>and wild type C57BL/6 mice. Indeed, the anti-tumor effects of citalopram involve non-immune mechanisms. Previously, we have demonstrated the direct effects of citalopram on cancer cell proliferation, apoptosis, and metabolic processes (PMID: 39388353). In this study, we focused on immune-dependent mechanisms, utilizing Rag1<sup>-/- </sup> mice to investigate a potential immune-mediated effect. The relative tumor weight was calculated by assigning an arbitrary value of 1 to the Rag1<sup>-/- </sup> mice in the DMSO treatment group, with all other tumor weights expressed relative to this baseline. As suggested, we have included absolute tumor weight data in the revised Figure 1B, 1E, 1F, and 3B.

      (2) The authors used shSlc6a4 tumor cell lines to demonstrate that citalopram's effects are independent of the conventional SERT receptor (Figure 1C-F). However, this does not entirely exclude the possibility that SERT may still play a role in this context, as it can be expressed in other cells within the tumor microenvironment. What is the expression profiling of Slc6a4 in the HCC tumor microenvironment? In addition, in Figure 1F, the tumor growth of shSlc6a4 in C57 mice displayed a decreased trend, suggesting a possible role of Slc6a4. 

      As suggested, we probed the expression pattern of SERT in HCC and its tumor microenvironment. Using a single cell sequencing dataset of HCC (GSE125449), we revealed that SERT is also expressed by T cells, tumor-associated endothelial cells, and cancer-associated fibroblasts (see revised Figure S2G). Therefore, we cannot fully rule out the possibility that citalopram may influence these cellular components within the TME and contribute to its therapeutic effects. In the revised manuscript, we have included and discussed this result. In Figure 1F, SERT knockdown led to a 9% reduction in tumor growth, however, this difference was not statistically significant (0.619 ± 0.099 g vs. 0.594 ± 0.129 g; p = 0.75).

      (3) Why did the authors choose to study phagocytosis in Figures 3G-H? As an important player, TAM regulates tumor growth via various mechanisms. 

      We choose to investigate phagocytosis because citalopram targets C5aR1-expressing TAM. C5aR1 is a receptor for the complement component C5a, which plays a crucial role in mediating the phagocytosis process in macrophages. In the revised manuscript, we have highlighted this rationale.

      (4) The information on unchanged deposition of C5a has been mentioned in this manuscript (Figures 3D and 3F), the authors should explain further in the manuscript, for example, C5a could bind to receptors other than C5aR1 and/or C5a bind to C5aR1 by different docking anchors compared with citalopram.

      Thank you for your insightful comment. In Figure 3D, tumor growth was attenuated in C5ar1<sup>-/-</sup> recipients compared with C5ar1<sup>-/-</sup> recipients, whereas C5a deposition remained unchanged. This suggests that while C5a is still present, its interaction with C5aR1 is critical for influencing tumor growth dynamics. In Figure 3F, C5a deposition was not affected by citalopram treatment. Indeed, docking analysis and DARTS assay revealed that citalopram binds to the D282 site of C5aR1. Previous report has shown that mutations on E199 and D282 reduce C5a binding affinity to C5aR1 (PMID: 37169960). Therefore, the impact of citalopram is primarily on C5a/C5aR1 interactions and downstream signaling pathways, rather than on altering C5a levels. In the revised manuscript, we have included this interpretation.

      (5) Figure 3I-M - the flow cytometry data suggested that citalopram treatment altered the proportions of total TAM, M1 and M2 subsets, CD4<sup>+</sup> and CD8<sup>+</sup>T cells, DCs, and B cells. Why does the author conclude that the enhanced phagocytosis of TAM was one of the major mechanisms of citalopram? As the overall TAM number was regulated, the contribution of phagocytosis to tumor growth may be limited. 

      We thank the reviewer’s valuable input. Indeed, recent studies have demonstrated that targeting C5aR1<sup>+</sup> TAMs can induce many anti-tumor effects, such as macrophage polarization and CD8<sup>+</sup> T cell infiltration (PMID: 30300579, PMID: 38331868, and PMID: 38098230). In the revised manuscript, we have clarified our conclusion to better articulate the relationship between citalopram treatment, TAM populations, and their phagocytic activity, with particular emphasis on the role of CD8<sup>+</sup> T cells. For macrophage phagocytosis, one possible explanation is that citalopram targets C5aR1 to enhance macrophage phagocytosis and subsequent antigen presentation and/or cytokine production, which promotes T cell recruitment and activity as well as modulate other aspects of tumor immunity. Given that the anti-tumor effects of citalopram are largely dependent on CD8<sup>+</sup> T cells, we conclude that CD8<sup>+</sup> T cells are essential for the effector mechanisms of citalopram.

      (6) Figure 4 - what is the rationale for using the MASH-associated HCC mouse model to study metabolic regulation in CD8<sup>+</sup> T cells? The tumor microenvironment and tumor growth would be quite different. In addition, how does this part link up with the mechanisms related to C5aR1 and TAM? The authors also brought GLUT1 back in the last part and focused on CD8<sup>+</sup> T cell metabolism, which was totally separated from previous data. 

      We chose the MASH-associated HCC mouse model because it closely mimics the etiology of metabolic-associated fatty liver disease (MAFLD), which is a significant contributor to the development of cirrhosis and HCC. In addition to the MASH-associated HCC mouse model, the study also incorporated the orthotopic Hepa1-6 tumor model. In our previous publication (Dong et al., Cell Reports 2024), we employed both of these HCC models. Therefore, we utilized the same two mouse models in this study. The inclusion of CD8<sup>+</sup> T cells in our study is based on the understanding that citalopram targets GLUT1, which plays a crucial role in glucose uptake (PMID: 39388353). CD8<sup>+</sup>T cell function is heavily reliant on glycolytic metabolism, making it essential to investigate how citalopram’s effects on GLUT1 influence the metabolic pathways and functionality of these immune cells. In this study, we identified that the primary glucose transporter in CD8<sup>+</sup> T cells is GLUT3, rather than GLUT1. The data presented in Figure 4 aim to illustrate the additional effect of citalopram on peripheral 5-HT levels, which, in turn, influences CD8<sup>+</sup> T cell functionality. By linking these findings, we clarify how citalopram impacts both TAMs and CD8<sup>+</sup> T cells. CD8<sup>+</sup> T cells can be influenced by citalopram through various mechanisms, including TAM-dependent mechanisms, reduced systemic serum 5-HT concentrations, and unidentified direct effects. In the revised manuscript, we have enhanced the background information to avoid any gaps.

      (7) Figure 5, the authors illustrated their mechanism that citalopram regulates CD8<sup>+</sup> T cell anti-tumor immunity through proinflammatory TAM with no experimental evidence. Using only CD206 and MHCII to represent TAM subsets obviously is not sufficient. 

      Thank you for your valuable comments. As noted by the reviewer, TAMs can influence CD8<sup>+</sup> T cell anti-tumor immunity through various mechanisms. In this study, we focused on elucidating the impact of citalopram on pro-inflammatory TAMs, which in turn affect CD8<sup>+</sup> T cell anti-tumor immunity and ultimately influence tumor outcomes. Therefore, in the mechanistic diagram, we highlighted the effect of citalopram on pro-inflammatory TAMs, while the causal relationship between TAMs and CD8<sup>+</sup> T cell anti-tumor immunity was indicated with a dotted line due to the limited evidence presented in this study. Additionally, we have expanded our discussion on how citalopram regulates CD8<sup>+</sup> T cell anti-tumor immunity through pro-inflammatory TAMs.

      For the analysis of TAMs, we initially sorted CD45<sup>+</sup>F4/80<sup>+</sup>CD11b<sup>+</sup> cells and assessed M1/M2 polarization by measuring CD206 and MHCII expression. As an added strength, we isolated TAMs from the orthotopic GLUT1<sup>KD</sup> Hepa1-6 model using CD11b microbeads and conducted real-time qPCR analysis of M1-oriented (Il6, Ifnb1, and Nos2) and M2-oriented (Mrc1, Il10, and Arg1) markers. Consistent with our flow cytometry data, the qPCR results confirmed that citalopram induces a pro-inflammatory TAM phenotype (revised Figure S9A).

      Reviewer #2 (Public review): Summary: 

      Dong et al. present a thorough investigation into the potential of repurposing citalopram, an SSRI, for hepatocellular carcinoma (HCC) therapy. The study highlights the dual mechanisms by which citalopram exerts anti-tumor effects: reprogramming tumor-associated macrophages (TAMs) toward an anti-tumor phenotype via C5aR1 modulation and suppressing cancer cell metabolism through GLUT1 inhibition while enhancing CD8+ T cell activation. The findings emphasize the potential of drug repurposing strategies and position C5aR1 as a promising immunotherapeutic target. However, certain aspects of experimental design and clinical relevance could be further developed to strengthen the study's impact. 

      We thank the reviewer’s thoughtful review and constructive feedback. As suggested, we have made improvements based on the feedback provided.

      Strength: 

      It provides detailed evidence of citalopram's non-canonical action on C5aR1, demonstrating its ability to modulate macrophage behavior and enhance CD8+ T cell cytotoxicity. The use of DARTS assays, in silico docking, and gene signature network analyses offers robust validation of drug-target interactions. Additionally, the dual focus on immune cell reprogramming and metabolic suppression presents a thorough strategy for HCC therapy. By emphasizing the potential for existing drugs like citalopram to be repurposed, the study also underscores the feasibility of translational applications. 

      We sincerely appreciate the reviewer’s recognition of the detailed evidence supporting citalopram’s non-canonical action on C5aR1, along with the innovative methodologies employed and the promising potential for repurposing existing drugs in HCC therapy.

      Major weaknesses/suggestions: 

      The dataset and signature database used for GSEA analyses are not clearly specified, limiting reproducibility. The manuscript does not fully explore the potential promiscuity of citalopram's interactions across GLUT1, C5aR1, and SERT1, which could provide a deeper understanding of binding selectivity. The absence of GLUT1 knockdown or knockout experiments in macrophages prevents a complete assessment of GLUT1's role in macrophage versus tumor cell metabolism. Furthermore, there is minimal discussion of clinical data on SSRI use in HCC patients. Incorporating survival outcomes based on SSRI treatment could strengthen the study's translational relevance. 

      By addressing these limitations, the manuscript could make an even stronger contribution to the fields of cancer immunotherapy and drug repurposing. 

      We appreciate the reviewer’s valuable suggestions. As suggested, we have included the following revisions:

      (a) GSEA analyses: For GSEA analyses, we conducted RNA sequencing (RNA-seq) analysis on HCC-LM3 cells treated with citalopram or fluvoxamine, which led to the identification of 114 differentially expressed genes (DEGs; 80 co-upregulated and 34 co-downregulated), as reported previously (PMID: 39388353). These DEGs were then utilized to create an SSRI-related gene signature. Subsequently, we analyzed RNA-seq data from liver HCC (LIHC) samples in The Cancer Genome Atlas (TCGA) cohort, comprising 371 samples, categorizing them into high and low expression groups based on the median expression levels of each candidate target gene (such as C5AR1). Finally, we performed GSEA on the grouped samples (C5AR1-high versus C5AR1-low) using the SSRI-related gene signature. In the revised manuscript, we have included this information in the “Materials and Methods” section.

      (b) Exploration of binding selectivity: We acknowledge the importance of exploring the potential promiscuity of citalopram’s interactions across GLUT1, C5aR1, and SERT1. While we cannot provide further experimental data to support this aspect, we have included the following points in the revised manuscript: 1) We emphasize the significance of exploring the relative binding affinities of citalopram to GLUT1, C5aR1, and SERT, as varying affinities could influence the drug’s overall efficacy. As highlighted in the current manuscript and our previous publication (PMID: 39388353), citalopram interacts with C5aR1 and GLUT1 through distinct binding sites and mechanisms, whereas its interaction with SERT is characterized by a more direct inhibition of serotonin binding (PMID: 27049939). To gain deeper insights into these interactions, employing techniques such as surface plasmon resonance or biolayer interferometry could provide valuable quantitative data on binding kinetics and affinities for each target. 2) We discuss how citalopram’s interactions with multiple targets may contribute to its therapeutic effects, particularly in the context of immune modulation and tumor progression. The potential for citalopram to exhibit diverse mechanisms of action through its interactions with these proteins warrants further investigation. A comprehensive understanding of these pathways could lead to the development of improved therapeutic strategies.

      (c) GLUT1 knockdown in macrophages: In the revised manuscript, we revealed that TAMs predominantly express GLUT3 but not GLUT1 (Figures S8B and S8C). GLUT1 knockdown in THP-1 cells did not significantly impact their glycolytic metabolism (Figure S8D), whereas GLUT3 knockdown led to a marked reduction in glycolysis in THP-1 cells.

      (d) Clinical data on SSRI use in HCC patients: Previously, we have reported that SSRIs use is associated with reduced disease progression in HCC patients (PMID: 39388353) (Cell Rep. 2024 Oct 22;43(10):114818.). As detailed below:

      “We determined whether SSRIs for alleviating HCC are supported by real-world data. A total of 3061 patients with liver cancer were extracted from the Swedish Cancer Register. Among them, 695 patients had been administrated with post-diagnostic SSRIs. The Kaplan-Meier survival analysis suggested that patients who utilized SSRIs exhibited a significantly improved metastasis-free survival compared to those who did not use SSRIs, with a P value of log-rank test at 0.0002. Cox regression analysis showed that SSRI use was associated with a lower risk of metastasis (HR = 0.78; 95% CI, 0.62-0.99)”.

      Reviewer #1 (Recommendations for the authors):

      (1) Add experiments to address the questions listed in the weaknesses.

      As suggested, related experiments are performed to strengthen the conclusions.

      (2) It would be appreciated to show the expression profile of SERT or employ KO mouse models to eliminate the effect of SERT.

      As suggested, analysis of a single-cell sequencing dataset of HCC (GSE125449) revealed that SERT is expressed not only in HCC cells but also in T cells, tumor-associated endothelial cells, and cancer-associated fibroblasts (Figure S2G). Consistently, SERT has been reported as an immune checkpoint restricting CD8 T cell antitumor immunity (PMID: 40403728). Furthermore, SERT KO mice (Cyagen Biosciences, S-KO-02549) was employed to investigate the effects of citalopram. However, the Slc6a4 gene knockout in mice resulted in a significant decrease in 5-HT levels in the brain and a lack of cortical columnar structures. Importantly, the mice exhibited an intolerance to citalopram treatment. Therefore, we did not pursue further investigation into the effects of citalopram in SERT KO mice.

      (3) Due to the concern of specificity and animal health, it would be more direct if the authors could use, for example, C5ar1-fl/fl x Adgre1-Cre mouse models.

      Thank you for your valuable suggestion. We fully agree with your comment regarding the value of introducing C5ar1-fl/fl and Adgre1-Cre mouse models, along with the necessary experimental setups, to substantiate this point. However, in our study, the C5ar1 KO mice exhibited normal overall appearance and viability, indicating that the model is generally healthy. Furthermore, we have validated the specific role of C5aR1 in macrophages through bone marrow reconstitution experiments, reinforcing the importance of C5aR1 in these cells. Therefore, we chose the current model to balance experimental effectiveness with considerations for animal health.

      (4) For example, a GSEA or GO analysis of comparison of macrophages from C5ar1-/- or C5ar1+/- mice may point to the enriched pathway of phagocytosis in macrophages derived from C5ar1-/- rather than C5ar1+/- mice, and this information is helpful for the integrity of this work. Besides, it would be more reliable if a nucleus staining is included in Figures 3G and 3H.

      As suggested, macrophages were isolated from tumor-bearing C5ar1<sup>-/-</sup> and C5ar1<sup>+/-</sup> mice and subsequently analyzed using RNA sequencing. The Gene Set Enrichment Analysis (GSEA) revealed a significant enrichment of the phagocytosis pathway in macrophages derived from C5ar1<sup>-/-</sup> mice compared to those from C5ar1<sup>+/-</sup> mice (see revised Figure S6A). While we acknowledge that the addition of a nucleus staining would enhance reliability, we would like to point out that this style of presentation is also commonly found in articles related to phagocytosis. Furthermore, this experiment involved a significant number of experimental mice, and in accordance with the 3Rs principle for animal experiments, we did not obtain additional sorted TAMs to perform the phagocytosis assay. Thank you for your understanding.

      (5) In line 122, there is a typo, and it should be 'analysis'.

      Thank you for pointing out the typo. It has been corrected to "analysis" in the revised manuscript.

      (6) In line 217, there is no causal relationship between the contexts, and using 'as a result' may lead to misunderstanding.

      As suggested, ‘as a result’ has been removed to avoid any misunderstanding.

      (7) In line 322, please make sure if it should be HBS or PBS.

      It is PBS, and revisions have been made.

      (8) Figure S7, the calculation of cell proportions needs to use a consistent denominator.

      As suggested, we calculated cell proportions using a consistent denominator (CD45<sup>+</sup> cells).

      (9) Figure 4C, label error.

      Thanks for your careful review. It has been corrected to "MASH".

      Reviewer #2 (Recommendations for the authors):

      Dong et al. present compelling evidence for repurposing citalopram, a selective serotonin reuptake inhibitor (SSRI), as a potential therapeutic for hepatocellular carcinoma (HCC). While the concept of SSRI repurposing is not novel, this manuscript provides valuable insights into the drug's dual mechanisms: targeting tumor-associated macrophages (TAMs) via C5aR1 modulation and enhancing CD8+ T cell activity, alongside inhibiting cancer cell metabolism through GLUT1 suppression. The findings underscore the promise of drug repurposing strategies and identify C5aR1 as a noteworthy immunotherapeutic target. Addressing the following points will enhance the manuscript's impact and relevance to cancer immunotherapy.

      Specific Comments:

      (1) The authors identify C5aR1 on TAMs as a direct target of citalopram, independent of its classical SERT target, using drug-induced gene signature network analysis and co-immunofluorescence of CD163+ macrophages with C5aR1. The DARTS assay further supports the binding of C5aR1 to citalopram, complemented by in silico docking analysis adapted from their previous GLUT1 study. Since GLUT1 and SERT1 are transporter proteins while C5aR1 is a GPCR, these heterogeneous binding interactions suggest potential promiscuity in SSRI-target engagement.

      (a) Figure 2A: The authors identify C5aR1 as a target using GSEA but do not specify the dataset used (e.g., cancer or immune cells) or the signature database consulted. Providing this context would enhance reproducibility.

      For GSEA, we performed RNA sequencing (RNA-seq) on HCC-LM3 cells treated with citalopram or fluvoxamine and identified 114 differentially expressed genes (DEGs), which included 80 genes that were co-upregulated and 34 that were co-downregulated, as previously documented (PMID: 39388353). These DEGs were subsequently used to develop an SSRI-related gene signature. We then employed the RNA-seq data from liver hepatocellular carcinoma (LIHC) samples within The Cancer Genome Atlas (TCGA) cohort, which included 371 samples. HCC samples in the TCGA cohort were categorized into high and low expression groups based on the median expression levels of each candidate target gene, such as C5AR1. Finally, we conducted GSEA on the grouped samples (such as C5AR1-high versus C5AR1-low) using the SSRI-related gene signature. For reproducibility, detailed information has been added to the “Materials and Methods” section of the revised manuscript.

      (b) Figure 2F: Given citalopram's reported role in inhibiting GLUT1, a comparative discussion on the relative contributions of GLUT1 inhibition versus C5aR1 modulation in tumor suppression is warranted. Performing a DARTS assay for GLUT1 in THP-1 cells, which express high GLUT1 levels and exhibit upregulation in M1 macrophages (https://doi.org/10.1038/s41467-022-33526-z), would clarify SSRI interactions with macrophage metabolism.

      As suggested, we first investigated citalopram treatment in THP-1 cells. The result showed the glycolytic metabolism of THP-1 cells remained largely unaffected following citalopram treatment, as evidenced by glucose uptake, lactate release, and extracellular acidification rate (ECAR) (Figure S8A). Next, we mined a single cell sequencing datasets of HCC and revealed that TAMs predominantly express GLUT3 but not GLUT1 (Figure S8B). Consistently, Western blotting analysis showed a higher expression of GLUT3 and minimal levels of GLUT1 in THP-1 cells (Figure S8C). Consistently, it has been well documented that GLUT1 expression increased after M1 polarization stimuli an GLUT3 expression increased after M2 stimulation in macrophages (PMID: 37721853, PMID: 36216803). GLUT1 knockdown in THP-1 cells did not significantly impact their glycolytic metabolism (Figure S8D), whereas GLUT3 knockdown led to a marked reduction in glycolysis in THP-1 cells. Based on these findings, we conclude that the effects of citalopram on macrophages are primarily mediated through targeting C5aR1 rather than GLUT1.

      (c) Figures 2H-I: A comparison of drug-protein interactions across GLUT1, C5aR1, and SERT1 would be valuable to identify potential shared or distinct binding features.

      Citalopram exhibits distinct binding characteristics across its various targets, including GLUT1, C5aR1, and its classical target, SERT. In the case of C5aR1, our in silico docking analysis identified two key binding conformations at the orthosteric site. The interactions involved significant electrostatic contacts between citalopram’s amino group and negatively charged residues like E199 and D282. Notably, D282’s accessibility and orientation towards the binding cavity suggest it plays a crucial role in citalopram binding, highlighting the importance of specific amino acid interactions at this site. For GLUT1 (PMID: 39388353), citalopram’s interaction also demonstrated notable hydrophobic contacts, particularly through the fluorophenyl group with residues V328, P385, and L325. The cyanophtalane group penetrated the substrate-binding cavity, indicating that citalopram could occupy a similar binding site as glucose, which is distinct from the binding mechanism observed in C5aR1. The involvement of E380 in both poses for GLUT1 further emphasizes the role of electrostatic interactions in mediating citalopram’s binding to this transporter. In contrast, for SERT (PMID: 27049939), citalopram locks the transporter in an outward-open conformation by occupying the central binding site, which is located between transmembrane helices 1, 3, 6, 8 and 10. This binding directly obstructs serotonin from accessing its binding site, illustrating a more definitive blockade mechanism. Additionally, the allosteric site at SERT, positioned between extracellular loops 4 and 6 and transmembrane helices 1, 6, 10, and 11, enhances this blockade by sterically hindering ligand unbinding, thus providing a clear explanation for the allosteric modulation of serotonin transport. In summary, while citalopram interacts with C5aR1 and GLUT1 through distinct binding sites and mechanisms, its interaction with SERT is characterized by a more straightforward blockade of serotonin binding. The unique structural and functional attributes of each target highlight the versatility of citalopram and suggest that its pharmacological effects may vary significantly depending on the specific protein being targeted. In the revised manuscript, we have included detailed information in the revised manuscript.

      (2) The manuscript presents evidence that citalopram reprograms TAMs to an anti-tumor phenotype, enhancing their phagocytic capacity.

      (a) Bone Marrow Reconstitution Experiments (Figure 3): The use of donor (dC5aR1) and recipient (rC5aR1) mice is significant but requires clarification. Explicitly defining donor and recipient terminology and including a schematic of the experimental design would improve reader comprehension.

      We appreciate your valuable feedback. As suggested, the terminology for donor (dC5aR1) and recipient (rC5aR1) mice was defined: “we injected GLUT1<sup>KD</sup> Hepa1-6 cells into syngeneic recipient C5ar1<sup>-/-</sup> (rC5ar1<sup>-/-</sup> ) mice that had been reconstituted with donor C5ar1<sup>+/-</sup> (dC5ar1<sup>+/-</sup>) or C5ar1<sup>-/-</sup> (dC5ar1<sup>-/-</sup>) bone marrow (BM) cells to analyze the therapeutic effect of citalopram”. Additionally, we have included a schematic of the experimental design to enhance reader comprehension (see revised Figure 3E).

      (b) GLUT1 Knockdown (KD) Tumor Cells: While GLUT1 KD tumor cells are utilized, the authors do not assess GLUT1 KD or knockout (KO) in macrophages. Testing the effect of citalopram on macrophages with GLUT1 KO/KD would help determine the relative importance of C5aR1 versus GLUT1 in mediating SSRI effects.

      As responded above, GLUT1 knockdown in THP-1 cells did not significantly alter their glycolytic metabolism (Figure S8D). This observation can be explained by the predominant expression of GLUT3 in TAMs rather than GLUT1 (Figures S8B and S8C). Indeed, knockdown of GLUT3 led to a significant reduction in glycolysis in THP-1 cells (Figure S8C).

      (c) C5aR1's Pro-Tumoral Role: The authors state that C5aR1 fosters an immunosuppressive microenvironment but omit a discussion of current literature on C5aR1's pro-tumoral role (e.g., https://doi.org/10.1038/s41467-024-48637-y, https://www.nature.com/articles/s41419-024-06500-4, https://doi.org/10.1016/j.ymthe.2023.12.010). Including this background in both the introduction and discussion would contextualize their findings.

      Thanks for your valuable feedback. As suggested, we have revised the manuscript to include discussions on C5aR1’s pro-tumoral role, referencing the suggested studies in both the introduction and discussion sections for better context. As detailed below:

      (1) Targeting C5aR1<sup>+</sup> TAMs effectively reverses tumor progression and enhances anti-tumor response;

      (2) Targeting C5aR1 reprograms TAMs from a protumor state to an antitumor state, promoting the secretion of CXCL9 and CXCL10 while facilitating the recruitment of cytotoxic CD8<sup>+</sup> T cells;

      (3) Moreover, citalopram induces TAM phenotypic polarization towards to a M1 proinflammatory state, which supports anti-tumor immune response within the TME.

      (d) C5aR1 Expression in TAMs: Is C5aR1 expression constitutive in TAMs? Further details on C5aR1 expression dynamics in TAMs under different conditions could strengthen the discussion. Public datasets on TAMs in various states (e.g., https://www.nature.com/articles/s41586-023-06682-5, https://www.cell.com/cell/abstract/S0092-8674(19)31119-5, https://pubmed.ncbi.nlm.nih.gov/36657444/) may offer useful insights.

      Thank you for your valuable suggestions. As suggested, we investigated the expression patterns of C5aR1 in TAMs using a HCC cohort (http://cancer-pku.cn:3838/HCC/). In the study conducted by Qiming Zhang et al. (PMID: 31675496), six distinct macrophage subclusters were identified, with M4-c1-THBS1 and M4-c2-C1QA showing significant enrichment in tumor tissues. M4-c1-THBS1 was enriched with signatures indicative of myeloid-derived suppressor cells (MDSCs), while M4-c2-C1QA exhibited characteristics that resembled those of TAMs as well as M1 and M2 macrophages. Our subsequent analysis revealed that C5aR1 is highly expressed in these two clusters, while expression levels in the other macrophage clusters were notably lower (see revised Figure S3).

      (3) The manuscript shows that citalopram-induced reductions in systemic serotonin levels enhance CD8+ T cell activation and cytotoxicity, as evidenced by increased glycolytic metabolism and elevated IFN-γ, TNF-α, and GZMB expression.

      (a) How CD8+ T cell activation is done in serotonin-deficient environments?

      As reported (PMID: 34524861), one possible explanation is that serotonin may enhance PD-L1 expression on cancer cells, thereby impairing CD8<sup>+</sup> T cell function. A deficiency of serotonin in the tumor microenvironment can delay tumor growth by promoting the accumulation and effector functions of CD8<sup>+</sup> T cells while reducing PD-L1 expression. In addition to the SERT-mediated transport and 5-HT receptor signaling, CD8<sup>+</sup> T cells can express TPH1 (PMID: 38215751, PMID: 40403728), enabling them to synthesize endogenous 5-HT, which activates their activity through serotonylation-dependent mechanisms (PMID: 38215751). In the revised manuscript, we have incorporated these interpretations.

      (4) Suggestions for the model figure revision-C5aR1 in TAMs without Citalopram (Figure 5).

      (a) Including a control scenario depicting receptor status and function in TAMs without citalopram treatment would provide a clearer baseline for understanding citalopram's effects.

      Thank you for your valuable input regarding the model figure revision. We have included a revised mechanism model that depicts the receptor status and function of C5aR1 in TAMs without citalopram treatment, as you suggested.

      (5) Suggestions for addressing clinical relevance.

      The study predominantly uses preclinical mouse models, although some human HCC data is analyzed (Figures 2B and 3O). However, there is no discussion of clinical data on SSRI use in HCC patients.

      Incorporating an analysis of patient survival outcomes based on SSRI treatment (e.g., https://pmc.ncbi.nlm.nih.gov/articles/PMC5444756/, https://pmc.ncbi.nlm.nih.gov/articles/PMC10483320/) would enhance the translational relevance of the findings.

      Previously, we reported that the use of SSRIs is associated with reduced disease progression in HCC patients, based on real-world data from the Swedish Cancer Register (PMID: 39388353). As suggested, we have further discussed the clinical relevance of SSRIs in the revised manuscript. As detailed below:

      “In a study involving 308,938 participants with HCC, findings indicated that the use of antidepressants following an HCC diagnosis was linked to a decreased risk of both overall mortality and cancer-specific mortality (PMID: 37672269). These associations were consistently observed across various subgroups, including different classes of antidepressants and patients with comorbidities such as hepatitis B or C infections, liver cirrhosis, and alcohol use disorders. Similarly, our analysis of real-world data from the Swedish Cancer Register demonstrated that SSRIs are correlated with slower disease progression in HCC patients (PMID: 39388353). Given these insights, antidepressants, especially SSRIs, show significant potential as anticancer therapies for individuals diagnosed with HCC”.

    1. eLife Assessment

      This functional MRI study critically tests the hypothesis that poor face recognition in developmental prosopagnosia in humans is driven by reduced spatial integration and smaller receptive fields in face-selective brain regions. The evidence provided is compelling as it is well-powered, uses state-of-the-art functional brain imaging, eye tracking, and computational analyses. The observed lack of difference in population receptive field sizes between face-selective brain regions of individuals with and without prosopagnosia, though a null result, has important implications for the field, and specifically, for theories of face recognition.

    2. Reviewer #1 (Public review):

      Summary:

      The authors examine the neural correlates of face recognition deficits in individuals with Developmental Prosopagnosia (DP; 'face blindness'). Contrary to theories that poor face recognition is driven by reduced spatial integration (via smaller receptive fields), here the authors find that the properties of receptive fields in face-selective brain regions are the same in typical individuals vs. those with DP. The main analysis technique is population Receptive Field (pRF) mapping, with a wide range of measures considered. The authors report that there are no differences in goodness-of-fit (R2), the properties of the pRFs (neither size, location, nor the gain and exponent of the Compressive Spatial Summation model), nor their coverage of the visual field. The relationship of these properties to the visual field (notably the increase in pRF size with eccentricity) is also similar between the groups. Eye movements do not differ between the groups.

      Strengths:

      Although this is a null result, the large number of null results gives confidence that there are unlikely to be differences between the two groups. Together, this makes a compelling case that DP is not driven by differences in the spatial selectivity of face-selective brain regions, an important finding that directly informs theories of face recognition. The paper is well written and enjoyable to read, the studies have clearly been carefully conducted with clear justification for design decisions, and the analyses are thorough.

      Weaknesses:

      One potential issue relates to the localisation of face-selective regions in the two groups. As in most studies of the neural basis of face recognition, localisers are used to find the face-selective Regions of Interest (ROIs) - OFA, mFus, and pFus, with comparison to the scene-selective PPA. To do so, faces are contrasted against other objects to find these regions (or scenes vs. others for the PPA). The one consistent difference that does emerge between groups in the paper is in the selectivity of these regions, which are less selective for faces in DP than in typical individuals (e.g., Figure 1B), as one might expect. 6/20 prosopagnosic individuals are also missing mFus, relative to only 2/20 typical individuals. This, to me, raises the question of whether the two groups are being compared fairly. If the localised regions were smaller and/or displaced in the DPs, this might select only a subset of the neural populations typically involved in face recognition. Perhaps the difference between groups lies outside this region. In other words, it could be that the differences in prosopagnosic face recognition lie in the neurons that are not able to be localised by this approach. The authors consider in the discussion whether their DPs may not have been 'true DPs', which is convincing (p. 12). The question here is whether the regions selected are truly the 'prosopagnosic brain areas' or whether there is a kind of survivor bias (i.e., the regions selected are normal, but perhaps the difference lies in the nature/extent of the regions. At present, the only consideration given to explain the differences in prosopagnosia is that there may be 'qualitative' differences between the two (which may be true), but I would give more thought to this.

      The discussion considers the differences between the current study and an unpublished preprint (Witthoft et al, 2016), where DPs were found to have smaller pRFs than typical individuals. The discussion presents the argument that the current results are likely more robust, given the use of images within the pRF mapping stimuli here (faces, objects, etc) as opposed to checkerboards in the prior work, and the use of the CSS model here as opposed to a linear Gaussian model previously. This is convincing, but fails to address why there is a lack of difference in the control vs. DP group here. If anything, I would have imagined that the use of faces in mapping stimuli would have promoted differences between the groups (given the apparent difference in selectivity in DPs vs. controls seen here), which adds to the reliability of the present result. Greater consideration of why this should have led to a lack of difference would be ideal. The latter point about pRF models (Gaussian vs. CSS) does seem pertinent, for instance - could the 'qualitative' difference lead to changes in the shape of these pRFs in prosopagnosia that are better characterised by the CSS model, perhaps? Perhaps more straightforwardly, and related to the above, could differences in the localisation of face-selective regions have driven the difference in prior work compared to here?

      Finally, the lack of variations in the spatial properties of these brain regions is interesting in light of the theories that spatial integration is a key aspect of effective face recognition. In this context, it is interesting to note the marked drop in R2 values in face-selective regions like mFus relative to earlier cortex. The authors note in some sense that this is related to the larger receptive field size, but is there a broader point here that perhaps the receptive field model (even with Compressive Spatial Summation) is simply a poor fit for the function of these areas? Could it be that these areas are simply not spatial at all? A broader link between the null results presented here and their implications for theories of face recognition would be ideal.

    3. Reviewer #2 (Public review):

      Summary:

      This is a well-conducted and clearly written manuscript addressing the link between population receptive fields (pRFs) and visual behavior. The authors test whether developmental prosopagnosia (DP) involves atypical pRFs in face-selective regions, a hypothesis suggested by prior work with a small DP sample. Using a larger cohort of DPs and controls, robust pRF mapping with appropriate stimuli and CSS modeling, and careful in-scanner eye tracking, the authors report no group differences in pRF properties across the visual processing hierarchy. These results suggest that reduced spatial integration is unlikely to account for holistic face processing deficits in DP.

      Strengths:

      The dataset quality, sample size, and methodological rigor are notable strengths.

      Weaknesses:

      The primary concern is the interpretation of the results.

      (1) Relationship between pRFs and spatial integration

      While atypical pRF properties could contribute to deficits in spatial integration, impairments in holistic processing in DPs are not necessarily caused by pRF abnormalities. The discussion could be strengthened by considering alternative explanations for reduced spatial integration, such as altered structural or functional connectivity in the face network, which has been reported to underlie DP's difficulties in integrating facial features.

      (2) Beyond the null hypothesis testing framework

      The title claims "normal spatial integration," yet this conclusion is based on a failure to reject the null hypothesis, which does not justify accepting the alternative hypothesis. To substantiate a claim of "normal," the authors would need to provide analyses quantifying evidence for the absence of effects, e.g., using a Bayesian framework.

      (3) Face-specific or broader visual processing

      Prior work from the senior author's lab (Jiahui et al., 2018) reported pronounced reductions in scene selectivity and marginal reductions in body selectivity in DPs, suggesting that visual processing deficits in DPs may extend beyond faces. While the manuscript includes PPA as a high-level control region for scene perception, scene selectivity was not directly reported. The authors could also consider individual differences and potential data-quality confounds (tSNR difference between and within groups, several obvious outliers in the figures, etc). For instance, examining whether reduced tSNR in DPs contributed to lower face selectivity in the DP group in this dataset.

      (4) Linking pRF properties to behavior

      The manuscript aims to examine the relationship between pRF properties and behavior, but currently reports only one aspect of pRF (size) in relation to a single behavioral measure (CFMT), without full statistical reporting:

      "We found no significant association between participants' CFMT scores and mean pRF size in OFA, pFUS, or mFUS."

      For comprehensive reporting, the authors could examine additional pRF properties (e.g., center, eccentricity, scaling between eccentricity and pRF size, shape of visual field coverage, etc), additional ROIs (early, intermediate, and category-selective areas), and relate them to multiple behavioral measures (e.g., HEVA, PI20, FFT). This would provide a full picture of how pRF characteristics relate to behavioral performance in DP.

    4. Author response:

      Reviewer #1 (Public review):

      Summary:

      The authors examine the neural correlates of face recognition deficits in individuals with Developmental Prosopagnosia (DP; 'face blindness'). Contrary to theories that poor face recognition is driven by reduced spatial integration (via smaller receptive fields), here the authors find that the properties of receptive fields in face-selective brain regions are the same in typical individuals vs. those with DP. The main analysis technique is population Receptive Field (pRF) mapping, with a wide range of measures considered. The authors report that there are no differences in goodness-of-fit (R2), the properties of the pRFs (neither size, location, nor the gain and exponent of the Compressive Spatial Summation model), nor their coverage of the visual field. The relationship of these properties to the visual field (notably the increase in pRF size with eccentricity) is also similar between the groups. Eye movements do not differ between the groups.

      Strengths:

      Although this is a null result, the large number of null results gives confidence that there are unlikely to be differences between the two groups. Together, this makes a compelling case that DP is not driven by differences in the spatial selectivity of face-selective brain regions, an important finding that directly informs theories of face recognition. The paper is well written and enjoyable to read, the studies have clearly been carefully conducted with clear justification for design decisions, and the analyses are thorough.

      Weaknesses:

      One potential issue relates to the localisation of face-selective regions in the two groups. As in most studies of the neural basis of face recognition, localisers are used to find the face-selective Regions of Interest (ROIs) - OFA, mFus, and pFus, with comparison to the scene-selective PPA. To do so, faces are contrasted against other objects to find these regions (or scenes vs. others for the PPA). The one consistent difference that does emerge between groups in the paper is in the selectivity of these regions, which are less selective for faces in DP than in typical individuals (e.g., Figure 1B), as one might expect. 6/20 prosopagnosic individuals are also missing mFus, relative to only 2/20 typical individuals. This, to me, raises the question of whether the two groups are being compared fairly. If the localised regions were smaller and/or displaced in the DPs, this might select only a subset of the neural populations typically involved in face recognition. Perhaps the difference between groups lies outside this region. In other words, it could be that the differences in prosopagnosic face recognition lie in the neurons that are not able to be localised by this approach. The authors consider in the discussion whether their DPs may not have been 'true DPs', which is convincing (p. 12). The question here is whether the regions selected are truly the 'prosopagnosic brain areas' or whether there is a kind of survivor bias (i.e., the regions selected are normal, but perhaps the difference lies in the nature/extent of the regions. At present, the only consideration given to explain the differences in prosopagnosia is that there may be 'qualitative' differences between the two (which may be true), but I would give more thought to this.

      We acknowledge that face-selective ROIs in DPs, relative to controls, may be smaller, less selective, or altogether missing when traditional methods of localization with fixed thresholds are used (Furl et al, 2011). For this reason - to circumvent potential survivor bias and ensure ROI voxel counts across participants are equated - we used a method of ROI definition whereby each subject’s individual statistical map from the localizer was intersected with a generously-sized group mask for each ROI and the top 20% most category-selective voxels were retained for the pRF analysis (Norman-Haignere et al., 2013; Jiahui et al., 2018). This means that the raw number of voxels per ROI was equal across all participants with respect to the common group space, thereby ensuring a fair comparison even in cases where one group shows diminished category-selectivity. The details of the ROI definition are provided in the Methods at the end of the manuscript. To ensure readers understand our approach, we will also make more explicit mention of this in the main body of the manuscript. 

      With regard to the question of whether face-selective ROIs may be displaced in DPs compared to controls, previous work from the senior author’s lab (Jiahui et al., 2018) shows that, despite exhibiting weaker activations, the peak coordinates of significant clusters in DPs occupy very similar locations to those of controls. And, even if there were indeed slight displacements of face-selective ROIs for some subjects, the group-defined masks used in the present analysis were large enough to capture the majority of the top voxels. In the supplemental materials section, we will include a diagram of the group masks used in our study.

      The reviewer here also points out that more DPs than controls were missing the mFUS region (6/20 DPs vs 2/20 controls; Figure 1C). However, ‘missing’ in this context was not based on face-selectivity but rather a lack of retinotopic tuning. PRFs were fit to all voxels within each ROI - with all subjects starting out with equal voxel counts - and thereafter, voxels for which the variance explained by the pRF model was below 20% were excluded from subsequent analysis. We decided that any ROI with fewer than 10 voxels remaining after thresholding on the pRF fit should be deemed ‘missing’ since we considered the amount of data insufficient to reliably characterize the region’s retinotopic profile. While it may be somewhat interesting that four more DPs than controls were ‘missing’ left mFUS, using this particular set of decision criteria, it is important to keep in mind that left mFUS was just one of six face-selective regions under study. The other five regions, many of which evinced strong fits by the pRF model, were represented comparably in DPs and controls and showed high similarity in the pRF parameters. Furthermore, across most participants, mFUS exhibited a low proportion of retinotopically modulated voxels (defined as voxels with pRF R squared greater than 20%, see Figure 1D). A follow-up analysis showed that the count of voxels surviving pRF R squared thresholding in left mFUS was not significantly correlated with mean pRF size (r(30)=0.23, t=1.28,  p=0.21) indicating that the greater exclusion of DPs in this region is unlikely to have biased the group’s average pRF size.

      The discussion considers the differences between the current study and an unpublished preprint (Witthoft et al, 2016), where DPs were found to have smaller pRFs than typical individuals. The discussion presents the argument that the current results are likely more robust, given the use of images within the pRF mapping stimuli here (faces, objects, etc) as opposed to checkerboards in the prior work, and the use of the CSS model here as opposed to a linear Gaussian model previously. This is convincing, but fails to address why there is a lack of difference in the control vs. DP group here. If anything, I would have imagined that the use of faces in mapping stimuli would have promoted differences between the groups (given the apparent difference in selectivity in DPs vs. controls seen here), which adds to the reliability of the present result. Greater consideration of why this should have led to a lack of difference would be ideal. The latter point about pRF models (Gaussian vs. CSS) does seem pertinent, for instance - could the 'qualitative' difference lead to changes in the shape of these pRFs in prosopagnosia that are better characterised by the CSS model, perhaps? Perhaps more straightforwardly, and related to the above, could differences in the localisation of face-selective regions have driven the difference in prior work compared to here?

      We agree that the use of high-level mapping stimuli (including faces) adds to the reliability of the present results for DPs and could have further emphasized differences between the groups if true differences did, in fact, exist. We speculate on the extent to which the type of mapping stimuli and various other methodological factors (e.g. stimulus size, aperture design, pRF model) could have explained the divergent findings in our study versus that of Witthoft et al. (2016) in the section of the Discussion titled, “What factors may have contributed to the different results for the present study and Witthoft et al. (2016)”. In brief, our use of more colorful, naturalistic stimuli targeting higher-level visual areas elicited better model fits than the black and white checkerboard pattern used by Witthoft et al. (2016). The CSS model we used is better suited for higher-level regions and makes fewer assumptions than the linear pRF model. The field of view of our stimulus was smaller but still relevant for real-world perception of faces. Finally, our aperture design and longer run length likely also improved reliability. Overall, these methodological improvements, along with our larger sample size, provide stronger evidence for our findings. These are our best attempts to make sense of the divergent findings, but it is not possible to come to a definitive explanation. Examples abound of exaggerated or spurious effects from small-scale studies that ultimately fail to replicate in the related field of dyslexia research (Jednorog et al., 2015; Ramus et al., 2018) and neuroimaging research more generally (Turner et al., 2018; Poldrack et al., 2017). Sometimes there are clear explanations for a lack of replicability (e.g. software bugs, overly flexible preprocessing methods, etc.), but many times the real reason cannot be determined.

      Regarding the type of pRF model deployed, our use of a non-linear exponent (versus a linear model as in the Witthoft et al. (2016) preprint) is unlikely to explain the similarity we observed between the groups in terms of pRF size. Specifically, the groups did not show substantial differences in the exponent by ROI, as seen in Figure 1E, so the use of a linear model should, in theory, produce similar outcomes for the two groups. We will mention this point in the main text.

      Finally, the lack of variations in the spatial properties of these brain regions is interesting in light of the theories that spatial integration is a key aspect of effective face recognition. In this context, it is interesting to note the marked drop in R2 values in face-selective regions like mFus relative to earlier cortex. The authors note in some sense that this is related to the larger receptive field size, but is there a broader point here that perhaps the receptive field model (even with Compressive Spatial Summation) is simply a poor fit for the function of these areas? Could it be that these areas are simply not spatial at all? A broader link between the null results presented here and their implications for theories of face recognition would be ideal.

      The weaker pRF fits found in mFUS, to us, raise the question of whether there is a more effective pRF stimulus for these more anterior regions. For example, it might be possible to obtain higher and more reliable responses there using single isolated faces (Cf. Kay, Weiner, Grill-Spector, 2015). More broadly, though, we agree that it is important to acknowledge that the receptive field model might ultimately be a coarse and incomplete characterization of neural function in these areas. As the other reviewer suggests, one possibility is that other brain processes (e.g. functional or structural connectivity between ROIs) may give rise to holistic face processing in ways that are not captured by pRF properties.

      Reviewer #2 (Public review):

      Summary:

      This is a well-conducted and clearly written manuscript addressing the link between population receptive fields (pRFs) and visual behavior. The authors test whether developmental prosopagnosia (DP) involves atypical pRFs in face-selective regions, a hypothesis suggested by prior work with a small DP sample. Using a larger cohort of DPs and controls, robust pRF mapping with appropriate stimuli and CSS modeling, and careful in-scanner eye tracking, the authors report no group differences in pRF properties across the visual processing hierarchy. These results suggest that reduced spatial integration is unlikely to account for holistic face processing deficits in DP.

      Strengths:

      The dataset quality, sample size, and methodological rigor are notable strengths.

      Weaknesses:

      The primary concern is the interpretation of the results.

      (1) Relationship between pRFs and spatial integration

      While atypical pRF properties could contribute to deficits in spatial integration, impairments in holistic processing in DPs are not necessarily caused by pRF abnormalities. The discussion could be strengthened by considering alternative explanations for reduced spatial integration, such as altered structural or functional connectivity in the face network, which has been reported to underlie DP's difficulties in integrating facial features.

      We agree the Discussion section could benefit from mentioning that alterations to other neural mechanisms, besides pRF organization, could produce deficits in holistic processing. This could take the form of altered functional connectivity (Rosenthal et al., 2017; Lohse et al., 2016; Avidan et al., 2014) or altered structural connectivity (Gomez et al., 2015; Song et al., 2015)

      (2) Beyond the null hypothesis testing framework

      The title claims "normal spatial integration," yet this conclusion is based on a failure to reject the null hypothesis, which does not justify accepting the alternative hypothesis. To substantiate a claim of "normal," the authors would need to provide analyses quantifying evidence for the absence of effects, e.g., using a Bayesian framework.

      We acknowledge that, using frequentist statistical methods, failing to reject the null hypothesis is not sufficient to claim equivalence. For the revision, we will look into additional analyses that could quantify evidence for the null hypothesis. And we will adjust the wording of the title in this regard.

      (3) Face-specific or broader visual processing

      Prior work from the senior author's lab (Jiahui et al., 2018) reported pronounced reductions in scene selectivity and marginal reductions in body selectivity in DPs, suggesting that visual processing deficits in DPs may extend beyond faces. While the manuscript includes PPA as a high-level control region for scene perception, scene selectivity was not directly reported. The authors could also consider individual differences and potential data-quality confounds (tSNR difference between and within groups, several obvious outliers in the figures, etc). For instance, examining whether reduced tSNR in DPs contributed to lower face selectivity in the DP group in this dataset.

      Thank you for this suggestion - we will compare tSNR between the groups as a measure of data quality and we will include these comparisons. A preliminary look indicates that both groups possessed similar distributions of tSNR across many of the face-selective regions investigated here.

      (4) Linking pRF properties to behavior

      The manuscript aims to examine the relationship between pRF properties and behavior, but currently reports only one aspect of pRF (size) in relation to a single behavioral measure (CFMT), without full statistical reporting:

      "We found no significant association between participants' CFMT scores and mean pRF size in OFA, pFUS, or mFUS."

      For comprehensive reporting, the authors could examine additional pRF properties (e.g., center, eccentricity, scaling between eccentricity and pRF size, shape of visual field coverage, etc), additional ROIs (early, intermediate, and category-selective areas), and relate them to multiple behavioral measures (e.g., HEVA, PI20, FFT). This would provide a full picture of how pRF characteristics relate to behavioral performance in DP.

      We will report the full statistical values (r, p) for the (albeit non-significant) relationship between CFMT score and pRF size - thank you for bringing that to our attention. Additionally, we will add other analyses assessing the relationship between a wider array of pRF measures and the other behavioral tests administered to provide a more comprehensive picture of the relation between pRFs and behavior.

      References:

      Avidan, G., Tanzer, M., Hadj-Bouziane, F., Liu, N., Ungerleider, L. G., & Behrmann, M. (2014). Selective Dissociation Between Core and Extended Regions of the Face Processing Network in Congenital Prosopagnosia. Cerebral Cortex, 24(6), 1565–1578. https://doi.org/10.1093/cercor/bht007

      Furl, N., Garrido, L., Dolan, R. J., Driver, J., & Duchaine, B. (2011). Fusiform gyrus face selectivity relates to individual differences in facial recognition ability. Journal of Cognitive Neuroscience, 23(7), 1723–1740. https://doi.org/10.1162/jocn.2010.21545

      Gomez, J., Pestilli, F., Witthoft, N., Golarai, G., Liberman, A., Poltoratski, S., Yoon, J., & Grill-Spector, K. (2015). Functionally Defined White Matter Reveals Segregated Pathways in Human Ventral Temporal Cortex Associated with Category-Specific Processing. Neuron, 85(1), 216–227. https://doi.org/10.1016/j.neuron.2014.12.027

      Jednoróg, K., Marchewka, A., Altarelli, I., Monzalvo Lopez, A. K., van Ermingen-Marbach, M., Grande, M., Grabowska, A., Heim, S., & Ramus, F. (2015). How reliable are gray matter disruptions in specific reading disability across multiple countries and languages? Insights from a large-scale voxel-based morphometry study. Human Brain Mapping, 36(5), 1741–1754. https://doi.org/10.1002/hbm.22734

      Jiahui, G., Yang, H., & Duchaine, B. (2018). Developmental prosopagnosics have widespread selectivity reductions across category-selective visual cortex. Proceedings of the National Academy of Sciences of the United States of America, 115(28), E6418–E6427. https://doi.org/10.1073/pnas.1802246115

      Kay, K. N., Weiner, K. S., Kay, K. N., & Weiner, K. S. (2015). Attention Reduces Spatial Uncertainty in Human Ventral Temporal Cortex Attention Reduces Spatial Uncertainty in Human Ventral Temporal Cortex. Current Biology, 25(5), 595–600. https://doi.org/10.1016/j.cub.2014.12.050

      Lohse, M., Garrido, L., Driver, J., Dolan, R. J., Duchaine, B. C., & Furl, N. (2016). Effective connectivity from early visual cortex to posterior occipitotemporal face areas supports face selectivity and predicts developmental prosopagnosia. Journal of Neuroscience, 36(13), 3821–3828. https://doi.org/10.1523/JNEUROSCI.3621-15.2016

      Norman-Haignere, S., Kanwisher, N., & McDermott, J. H. (2013). Cortical pitch regions in humans respond primarily to resolved harmonics and are located in specific tonotopic regions of anterior auditory cortex. Journal of Neuroscience, 33(50), 19451–19469. https://doi.org/10.1523/JNEUROSCI.2880-13.2013

      Poldrack, R. A., Baker, C. I., Durnez, J., Gorgolewski, K. J., Matthews, P. M., Munafò, M. R., Nichols, T. E., Poline, J. B., Vul, E., & Yarkoni, T. (2017). Scanning the horizon: Towards transparent and reproducible neuroimaging research. Nature Reviews Neuroscience, 18(2), 115–126. https://doi.org/10.1038/nrn.2016.167

      Ramus, F., Altarelli, I., Jednoróg, K., Zhao, J., & Scotto di Covella, L. (2018). Neuroanatomy of developmental dyslexia: Pitfalls and promise. Neuroscience and Biobehavioral Reviews, 84(July 2017), 434–452. https://doi.org/10.1016/j.neubiorev.2017.08.001

      Rosenthal, G., Tanzer, M., Simony, E., Hasson, U., Behrmann, M., & Avidan, G. (2017). Altered topology of neural circuits in congenital prosopagnosia. ELife, 6, 1–20. https://doi.org/10.7554/eLife.25069

      Song, S., Garrido, L., Nagy, Z., Mohammadi, S., Steel, A., Driver, J., Dolan, R. J., Duchaine, B., & Furl, N. (2015). Local but not long-range microstructural differences of the ventral temporal cortex in developmental prosopagnosia. Neuropsychologia, 78, 195–206. https://doi.org/10.1016/j.neuropsychologia.2015.10.010

      Turner, B. O., Paul, E. J., Miller, M. B., & Barbey, A. K. (2018). Small sample sizes reduce the replicability of task-based fMRI studies. Communications Biology, 1(1). https://doi.org/10.1038/s42003-018-0073-z

      Witthoft, N., Poltoratski, S., Nguyen, M., Golarai, G., Liberman, A., LaRocque, K., Smith, M., & Grill-Spector, K. (2016). Reduced spatial integration in the ventral visual cortex underlies face recognition deficits in developmental prosopagnosia. BioRxiv, 1–26.

    1. eLife Assessment

      This manuscript makes a valuable contribution to understanding learning in multidimensional environments with spurious associations, which is critical for understanding learning in the real world. The evidence is based on model simulations and a preregistered human behavioral study, but remains incomplete because of inconclusive empirical results and insufficiencies in the modeling. Moreover, there are open questions about the nature and extent to which the behavioral task induced semantic congruency.

    2. Reviewer #1 (Public review):

      Summary:

      This paper reports model simulations and a human behavioral experiment studying predictive learning in a multidimensional environment. The authors claim that semantic biases help people resolve ambiguity about predictive relationships due to spurious correlations.

      Strengths:

      (1) The general question addressed by the paper is important.

      (2) The paper is clearly written.

      (3) Experiments and analyses are rigorously executed.

      Weaknesses:

      (1) Showing that people can be misled by spurious correlations, and that they can overcome this to some extent by using semantic structure, is not especially surprising to me. Related literature already exists on illusory correlation, illusory causation, superstitious behavior, and inductive biases in causal structure learning. None of this work features in the paper, which is rather narrowly focused on a particular class of predictive representations, which, in fact, may not be particularly relevant for this experiment. I also feel that the paper is rather long and complex for what is ultimately a simple point based on a single experiment.

      (2) Putting myself in the shoes of an experimental subject, I struggled to understand the nature of semantic congruency. I don't understand why the builder and terminal robots should have similar features is considered a natural semantic inductive bias. Humans build things all the time that look different from them, and we build machines that construct artifacts that look different from the machines. I think the fact that the manipulation worked attests to the ability of human subjects to pick up on patterns rather than supporting the idea that this reflects an inductive bias they brought to the experiment.

      (3) As the authors note, because the experiment uses only a single transition, it's not clear that it can really test the distinctive aspects of the SR/SF framework, which come into play over longer horizons. So I'm not really sure to what extent this paper is fundamentally about SFs, as it's currently advertised.

      (4) One issue with the inductive bias as defined in Equation 15 is that I don't think it will converge to the correct SR matrix. Thus, the bias is not just affecting the learning dynamics, but also the asymptotic value (if there even is one; that's not clear either). As an empirical model, this isn't necessarily wrong, but it does mess with the interpretation of the estimator. We're now talking about a different object from the SR.

      (5) Some aspects of the empirical and model-based results only provide weak support for the proposed model. The following null effects don't agree with the predictions of the model:

      (a) No effect of condition on reward.

      (b) No effect of condition on composition spurious predictiveness.

      (c) No effect of condition on the fitted bias parameter. The authors present some additional exploratory analyses that they use to support their claims, but this should be considered weaker support than the results of preregistered analyses.

      (6) I appreciate that the authors were transparent about which predictions weren't confirmed. I don't think they're necessarily deal-breakers for the paper's claims. However, these caveats don't show up anywhere in the Discussion.

      (7) I also worry that the study might have been underpowered to detect some of these effects. The preregistration doesn't describe any pilot data that could be used to estimate effect sizes, and it doesn't present any power analysis to support the chosen sample sizes, which I think are on the small side for this kind of study.

    3. Reviewer #2 (Public review):

      Summary:

      This work by Prentis and Bakkour examines how predictive memory can become distorted in multidimensional environments and how inductive biases may mitigate these distortions. Using both computational simulations and an original human-robot building task with manipulated semantic congruency, the authors show that spurious observations can amplify noise throughout memory. They hypothesize, and preliminarily support, that humans deploy inductive biases to suppress such spurious information.

      Strengths:

      (1) The manuscript addresses an interesting and understudied question-specifically, how learning is distorted by spurious observations in high-dimensional settings.

      (2) The theoretical modeling and feature-based successor representation analyses are methodologically sound, and simulations illustrate expected memory distortions due to multidimensional transitions.

      (3) The behavioral experiment introduces a creative robot-building paradigm and manipulates transitions to test the effect of semantic congruency (more so category part congruency as explained below).

      Weaknesses:

      (1) The semantic manipulation may be more about category congruence (e.g., body part function) than semantic meaning. The robot-building task seems to hinge on categorical/functional relationships rather than semantic abstraction. Strong evidence for semantic learning would require richer, more genuinely semantic manipulations.

      (2) The experimental design remains limited in dimensionality and depth. Simulated higher-dimensional or deeper tasks (or empirical follow-up) would strengthen the interpretation and relevance for real-world memory distortion.

      (3) The identification of idiosyncratic biases appears to reflect individual variation in categorical mapping rather than semantic processing. The lack of conjunctive learning may simply reflect variability in assumed builder-target mappings, not a principled semantic effect.

      Additional Comments:

      (1) It is unclear whether this task primarily probes memory or reinforcement learning, since the graded reward feedback in the current design closely aligns with typical reinforcement learning paradigms.

      (2) It may be unsurprising that the feature-based successor model fits best given task structure, so broader model comparisons are encouraged.

      (3) Simulation-only work on higher dimensionality (lines 514-515) falls short; an empirical follow-up would greatly enhance the claims.